"Accuracy" is one of the most over-claimed numbers in retail trading. Telegram channels quote 90%+. Strategy marketplaces show frictionless backtests with hindsight-tuned parameters. The user reads a single number and never sees the assumptions underneath it.

This article explains, in detail, how Stryqe's backtest works and what its accuracy figures actually represent. The honest version is unflattering in places — pooled 4-hour accuracy is currently sitting below the 50% line, and we publish that — but the methodology is the same one the engine uses internally to decide whether tuned weights are good enough to ship. If we're not going to lie to the engine, we're not going to lie to users.

The 60 × 2000 framework

Every six hours, a Cloud Function on the server side fires the backtest. The configuration is fixed:

ParameterValueWhy
Coins per run60Top USDT pairs by 24h volume on Binance
Candles per coin2000~83 days of 1-hour data — long enough to span multiple regimes
Concurrency3Limits exchange API rate-limit exposure
Run cadenceEvery 6h4 runs/day produces enough fresh ALIGNED signals to tune weights
Engine versionserver_full_v2Friction-aware, 24h-gated; identical math to the client scanner

The "identical math" claim is enforced, not asserted. The shared indicator module lives at two paths: shared/indicators.js (loaded by the browser) and functions/shared/indicators.js (loaded by the Cloud Function). A SHA-256 parity test runs on every deploy path and fails the deploy if the two files diverge by even a byte. Nobody — including the developer — can ship a server-side change that the client doesn't get.

What the engine does on each run

For each of the 60 coins, the function walks forward through the 2000-candle history. At every candle i from index 26 onward (where 26 is the minimum needed for MACD to compute), it asks: if I were running live right now and saw exactly this much data, would I emit an ALIGNED tag? The signal-generation function used here is byte-identical to the one running on the client, so a backtest signal at candle i is the same signal a live user would have seen at that point in time.

Every emitted signal then has its outcome checked against future candles. Two horizons are recorded for every signal: 4 hours ahead and 24 hours ahead. Each outcome captures whether the trade would have closed net-positive after subtracting realistic friction, given the same TP/SL geometry the live scanner uses.

TP and SL — anchored to ATR

The take-profit and stop-loss levels aren't fixed percentages. They scale with each coin's recent volatility, measured by Average True Range (ATR) over a 14-candle window:

Trade geometry (matches live scanner) TP = max(0.4, min(6.0, atrPct × 2.5)) // upside target
SL = −max(0.6, min(4.0, atrPct × 1.5)) // downside stop
where atrPct = ATR(14) / entryPrice × 100

The hard floors and ceilings (0.4–6.0% on TP, 0.6–4.0% on SL) prevent absurd values when ATR is pathological — a coin in a circuit-breaker halt has near-zero ATR but should not get a 0.01% take-profit. The 2.5× / 1.5× ratio gives roughly a 1.66:1 reward-to-risk geometry, which is mathematically the same on every signal regardless of the coin's volatility regime.

This matters because the live scanner displays exactly the same TP and SL on its signal cards. If a user takes the trade as shown, the backtested accuracy figure represents the same trade. We learned this the hard way — an earlier engine version had inverted geometry (TP=1.0×, SL=1.8× — about 0.55:1 reward-to-risk), which meant the published backtest was measuring a strategy users were never actually trading. The number looked great but was meaningless.

The win condition: net of friction

"Win" is not "price went up." It is "after fees and slippage, the round trip closed positive." The friction model:

For each signal, the engine determines an exit: TP if the price hit it first, SL if SL hit first, or the candle close at the lookahead horizon if neither was touched. The gross exit percentage gets 0.30 subtracted. A win is logged only if the result is still positive.

Why friction-aware

A frictionless backtest of any momentum strategy on alt-coin data overstates accuracy by 5–10 percentage points. The published Stryqe number is what you'd actually pocket — not the engine's idealised result.

4-hour vs 24-hour: a structural decision

Every signal records two outcomes. The 4-hour horizon was the original target — fast, scalp-style trades. The 24-hour horizon was added later as a sanity check. The empirical result was inconvenient: they tell completely different stories.

The current pooled baseline (n=96 ALIGNED signals across the rolling 20-run window — see the live accuracy panel for the current numbers) shows:

HorizonWinsTotalAccuracyvs random floor (37.5%)
4 hours379638.5%At noise floor — no measurable edge
24 hours439644.8%~7 points above floor — modest positive expectancy

Random-walk break-even at the 2.5×ATR / 1.5×ATR geometry is 37.5% — below this rate the strategy loses money over time even if it "wins more than not." Two takeaways from the table: the engine has a small but measurable edge on the 24-hour horizon, and effectively no edge at 4 hours under realistic friction. The 4-hour outcome is essentially noise for a strategy whose TP averages around 2.5–3% — in 4 hours, alt-coin price action rarely covers 2.5% TP plus 0.3% friction with any consistency. It's not that the engine is bad; it's that the time window is too short for the strategy's typical move size.

The honest action this implies — and the one taken — is: tune the engine on the 24-hour axis (where the edge is), gate adoption of new weights on 24-hour outcomes, and report the 4-hour number alongside without pretending it's good.

The never-worse guard

After each run, the engine proposes weight adjustments based on per-indicator hit rates. Then comes the gate that decides whether the proposal ships:

  1. Compute the actual 24-hour accuracy under the old weights on this run's signals.
  2. Re-score every signal in this run using the new weights, re-applying the 75-net-score and 3-bull-count thresholds. This produces a different signal set.
  3. Compute the simulated 24-hour accuracy under the new weights.
  4. If newAcc >= oldAcc, adopt the new weights. Otherwise, keep the old weights and discard the proposal.

This guard exists because a single run with unusual market conditions can produce weight adjustments that look reasonable in isolation but degrade overall accuracy. The guard rejects them silently. Over weeks, this filters proposed changes down to only those that survive on past data — biased, of course, toward overfitting recent regimes, but better than no filter at all.

Adoption gate: pooled n ≥ 30 and 24h accuracy ≥ 60%

Even when weights change on the server, the client doesn't blindly adopt them. There's a separate gate inside the live scanner that runs every time it boots: server-tuned weights are adopted only if the rolling 20-run pool has at least 30 ALIGNED signals AND the pooled 24-hour accuracy is at least 60%. Both must be tagged as the friction-aware v2 engine. If either condition fails, the live scanner falls back to whatever weights it has cached from the most recent successful adoption — or, on first run, the deterministic defaults shipped with the build.

The current live state: the pooled 24-hour accuracy is sitting around the mid-forties, which is above the random-walk floor but well short of the 60% gate. So the gate is closed: the tuned weights exist on the server, are written to Firestore on every cron run, but are not adopted by the client. This is intentional. A 60% gate that hasn't opened means the engine doesn't yet have evidence-strong enough edge to override the conservative defaults. It's a feature, not a bug.

It's worth being precise about what the score-based engine actually controls on the live scanner. The CII percentage and the score-based action tag use the tuned weights once adoption clears. The "Aligned" filter tab on the scanner — the one most users actually trade off — uses a separate rule-based classifier that emits ALIGNED whenever the deterministic conditions (above 200 EMA, daily trend confirmed, MACD crossover, three-of-twelve indicators flashing buy) all line up. Those rules don't read the tuned weights. So even when the gate clears, the score and the rule-based ALIGNED tag are partially independent — a deliberate design that prevents weight tuning from quietly shifting the user-facing primary signal.

The instrumentation that lets us measure both classifiers in parallel — same candles, same outcomes — is recent and we don't yet have enough data to compare them honestly. Once both pools reach n ≥ 50 we'll publish the comparison and let the data, not theory, decide whether to consolidate.

Pooled vs per-run accuracy

An individual 6-hour run might produce only 2–4 ALIGNED signals. Computing accuracy from a 4-signal sample is statistically meaningless — one win or loss swings the percentage by 25 points. Stryqe stores the raw win counts (not just rounded percentages) for every run, so the dashboard can pool the last 20 runs and compute a single accuracy from the cumulative wins/total. Pooling is what makes the figure trustworthy: with n=96 it's based on roughly 43 wins out of 96 trades on the 24-hour horizon, not on the average of percentage-points across runs (which would weight a 1/1 run equally with a 5/10 run). Even at this sample size the 95% confidence interval on the 24-hour rate is roughly ±10 percentage points, which is why we don't claim certainty about the edge — only that it is consistently above the random-walk floor across the rolling window.

What the backtest doesn't tell you

Limits we're upfront about

The backtest cannot measure: regime changes that lie outside the 83-day window, the impact of news events on individual coins, the effect of correlated positions on portfolio risk, partial fills or exchange outages, or the user's actual emotional response to a string of losses. A small positive expectancy doesn't help you if you stop trading after the first three-loss streak — and at sub-50% accuracy, three-loss streaks are mathematically common.

The 2000-candle window is also a deliberate trade-off: longer histories include older market regimes that may not represent current conditions, but shorter histories produce statistically weaker estimates. 2000 hourly candles ≈ 83 days, which spans typical multi-week swings without dragging in regimes from a year ago.

How to read the accuracy dashboard

The accuracy panel in the app shows the pooled 4h accuracy, the pooled 24h accuracy, the current sample size (n), the engine version tag, and whether the adoption gate has cleared. It updates every 6 hours after the cron run completes.

The single number to watch is the 24-hour pooled accuracy. That is the axis the engine ships on, the axis it tunes on, and the axis that determines whether the strategy has an edge after costs. The 4-hour number is shown for transparency but is not the engine's claim.

The bottom line

Backtesting is hard to do honestly because every methodological shortcut inflates the number. We've tried to make every shortcut explicit: the friction is realistic, the TP/SL geometry mirrors the live scanner exactly, the parity test prevents server/client drift, the never-worse guard rejects bad tunings, and we publish the unflattering 4-hour number alongside the favourable 24-hour one.

The result is a methodology where the published accuracy is consistently lower than competitors quote — and consistently closer to what users actually experience. We think that trade-off is worth it.