Methodology & Transparency
How Strategy Arena's ML and statistical layers actually work. We measured every Brier. We fixed every leak. Here's the real architecture.
Anti-marketing: if a layer is analytics, we call it analytics. If it is rules-based, we call it rules-based. If it is ML, we publish the real measured Brier.
New: Strategy Hospital publishes live strategy triage: healthy, drift, bug suspect, idle, or deprecated.
The 5 monsters: what they actually are
| Monster | Architecture | Real metric | Status |
|---|---|---|---|
| Invictus ML Ultimate | LightGBM with isotonic calibration, OOF validation and monotonic constraints | Brier OOS expected ~0.22 (calibrated) | Real ML Audited 8/10 by DeepSeek |
| Chimera Scanner + CNN | 17 statistical patterns + PyTorch CNN, 108 OHLCV/pattern channels | Brier OOS 0.2512 9,356 samples |
Hybrid Rules + real ML |
| Leviathan 9-Layer Ensemble | 8 heuristic layers + 1 PyTorch MLP as Layer 9 | Brier OOS 0.2589 10,758 samples, post-leak-fix |
Hybrid Heuristics + real ML |
| Hydra ML V5 + LSTM | XGBoost ranking for PnL + PyTorch LSTM for direction | Brier OOS 0.2480 51,718 samples |
Real dual ML |
| Meta Intelligence v3 | Strategy analytics: bootstrap CI, Bonferroni multi-compare, performance snapshots | No prediction Analytics engine |
Honest dashboard |
Brier > 0.25 = barely usable. Brier 0.25 is close to the practical ceiling for 5-minute crypto direction prediction. Our 3 ML models sit marginally around this ceiling: publishable, but not magic.
The methodology we use to validate strategies
Monte Carlo CV
30 random temporal splits, anchor between 20% and 70%.
30 random temporal splits, anchor between 20% and 70%.
Robustness gate
Sharpe_p5 > 0.5 on the 5th percentile of the 30 splits.
Sharpe_p5 > 0.5 on the 5th percentile of the 30 splits.
Trade count
n_trades_mean > 20 per OOS window, with at least 10 valid splits.
n_trades_mean > 20 per OOS window, with at least 10 valid splits.
- Fees included: 0.20% round-trip.
- Single-split validations are treated as weak until they survive MC CV.
- Example: Wyckoff Evolved had OOS Sharpe 1.85 on a single split, then MC mean Sharpe 0.73 on PUMP, -0.04 on INJ, -0.36 on FLOKI. We rejected it.
Strategies validated by Monte Carlo
| Strategy | Validated assets | Best Sharpe_p5 | Rejected on |
|---|---|---|---|
| Smart Money Evolved | BTC, ETH, SOL, BNB | 1.22 (BTC) | - |
| Mean Rev Pro Evolved | NEAR, SNX, CHZ, TIA | 1.189 (SNX) | TRB |
| Capitulation Rebound Evolved | BTC, SOL, BNB, NEAR, SNX, CHZ, TIA | 1.526 (SNX) | - |
| Deep Freeze Evolved | SNX, CHZ | 0.884 (CHZ) | BTC, ETH, SOL, BNB, NEAR, TIA, AVAX |
| Sly Fox Evolved | BNB | 0.599 | 8 others |
| Deep Shadow Evolved | BTC | 0.851 | 8 others |
| Wyckoff Evolved | none | - | PUMP, INJ, COMP, FLOKI |
| Darvas | none | - | BTC, ETH, SOL, BNB, TRB |
MC validations are now tracked live, cell by cell, to measure drift between theoretical Sharpe_p5 and real performance.
View live Monte Carlo results
View live Monte Carlo results
Data leaks we fixed
chimera_ml.py
Target leakage: avg_pnl was both feature and label source. Deleted on 2026-05-15.
leviathan_data_merger.py
3 look-ahead bugs: future news, regime using current bar, future one-hot. Fixed on 2026-05-15.
Honest consequence: Leviathan NN's Brier moved from 0.244 with leakage to 0.2589 without leakage. We publish the real number.
Why some "AI strategies" are not real ML
- Leviathan 9-Layer Ensemble Brain used to be 8 heuristic layers and storytelling. After the graft, it is a 9-Layer Ensemble: 8 heuristics + 1 PyTorch MLP.
- The old Chimera total was an exaggerated count from a live-accumulated brain JSON. We now display 50 peer-reviewed patterns, filtered with Bonferroni-Hochberg FDR alpha 0.05.
- ML Arena V3 used to be isolated from the main monsters. The models were migrated in-place into backend/: chimera_cnn.py, leviathan_nn.py, hydra_lstm.py.
What we are not claiming
We are not claiming to reliably predict crypto direction.
We are not claiming Brier < 0.20. That would be suspicious for this framing.
We are not claiming returns above 1-3 Sharpe without long validation.
We are not claiming a single magic unified AI brain.
What we claim: a transparent lab that measures everything, publicly fixes leaks, and refuses to publish as "edge" what does not survive strict Monte Carlo CV validation.