Methodology & Transparency
| Domain | Evidence | Gate | Status |
|---|---|---|---|
| Live strategies | arena_state_*_v5.json | Public count | 86 |
| ML / Brier | brier_autopsy_*.json | anti-leak OOF | published |
| Monte Carlo CV | live_mc_results_snapshot.json | Sharpe_p5 | 7 |
| Strategy Hospital | JSON | triage | PASS/WATCHLIST |
| Mode | paper trading | No real orders | paper-trading |
How Strategy Arena's ML and statistical layers actually work. We measured every Brier. We fixed every leak. Here's the real architecture.
Atlas Edge Allocator · Live MC Results · ML Edge Report · Portfolio MC · Strategy Lifecycle · Edge Radar
The 5 monsters: what they actually are
| Monster | Architecture | Real metric | Status |
|---|---|---|---|
| Invictus ML Ultimate | LightGBM with isotonic calibration, OOF validation and monotonic constraints | Brier OOS expected ~0.22 (calibrated) | Real ML Audited 8/10 by DeepSeek |
| Chimera Scanner + CNN | 17 statistical patterns + PyTorch CNN, 108 OHLCV/pattern channels | Brier OOS 0.2512 9,356 samples |
Hybrid Rules + real ML |
| Leviathan 9-Layer Ensemble | 8 heuristic layers + 1 PyTorch MLP as Layer 9 | Brier OOS 0.2589 10,758 samples, post-leak-fix |
Hybrid Heuristics + real ML |
| Hydra ML V5 + LSTM | XGBoost ranking for PnL + PyTorch LSTM for direction | Brier OOS 0.2480 51,718 samples |
Real dual ML |
| Meta Intelligence v3 | Strategy analytics: bootstrap CI, Bonferroni multi-compare, performance snapshots | No prediction Analytics engine |
Honest dashboard |
Brier 0.25: what it means
5m directional models plateau around 0.25. That is a measured limit, not an ML victory claim.
The real edge gate remains Monte Carlo CV: Sharpe_p5, fees, embargo, and live cell-by-cell tracking.
The methodology we use to validate strategies
30 random temporal splits, anchor between 20% and 70%.
Sharpe_p5 > 0.5 on the 5th percentile of the 30 splits.
n_trades_mean > 20 per OOS window, with at least 10 valid splits.
- Fees included: 0.20% round-trip.
- Single-split validations are treated as weak until they survive MC CV.
- Example: Wyckoff Evolved had OOS Sharpe 1.85 on a single split, then MC mean Sharpe 0.73 on PUMP, -0.04 on INJ, -0.36 on FLOKI. We rejected it.
Strategies validated by Monte Carlo
| Strategy | Validated assets | Best Sharpe_p5 | Rejected on |
|---|---|---|---|
| Smart Money Evolved | BTC, ETH, SOL, BNB | 1.22 (BTC) | - |
| Mean Rev Pro Evolved | NEAR, SNX, CHZ, TIA | 1.189 (SNX) | TRB |
| Capitulation Rebound Evolved | BTC, SOL, BNB, NEAR, SNX, CHZ, TIA | 1.526 (SNX) | - |
| Deep Freeze Evolved | SNX, CHZ | 0.884 (CHZ) | BTC, ETH, SOL, BNB, NEAR, TIA, AVAX |
| Sly Fox Evolved | BNB | 0.599 | 8 others |
| Deep Shadow Evolved | BTC | 0.851 | 8 others |
| Wyckoff Evolved | none | - | PUMP, INJ, COMP, FLOKI |
| Darvas | none | - | BTC, ETH, SOL, BNB, TRB |
View live Monte Carlo results
Data leaks we fixed
Target leakage: avg_pnl was both feature and label source. Deleted on 2026-05-15.
3 look-ahead bugs: future news, regime using current bar, future one-hot. Fixed on 2026-05-15.
Honest consequence: Leviathan NN's Brier moved from 0.244 with leakage to 0.2589 without leakage. We publish the real number.
Why some "AI strategies" are not real ML
- Leviathan 9-Layer Ensemble Brain used to be 8 heuristic layers and storytelling. After the graft, it is a 9-Layer Ensemble: 8 heuristics + 1 PyTorch MLP.
- The old Chimera total was an exaggerated count from a live-accumulated brain JSON. We now display 50 peer-reviewed patterns, filtered with Bonferroni-Hochberg FDR alpha 0.05.
- ML Arena V3 used to be isolated from the main monsters. The models were migrated in-place into backend/: chimera_cnn.py, leviathan_nn.py, hydra_lstm.py.
What we are not claiming
Newsjacker editorial process
Newsjacker connects one financial, crypto or AI news source to one precise Strategy Arena finding each day. It is assisted editorial infrastructure, not a marketing-content machine: original source link required, internal link to a measured finding required, caveat required, and owner review queue by default.
Distributed Research Network
Strategy Arena is moving toward a quantitative citizen-science layer: local personal backtests, public hardware benchmarks, then cooperative themed raids. The important constraint: every raid must start from a pre-registered hypothesis, with quorum, replication and publication of positive and negative results.
Local personal backtest. Client-side compute, optional save when logged in.
Standardized GPU contest: speed, stability, reproducibility, public leaderboard.
Research raids: 10-50 GPUs validating one hypothesis and producing an open paper.
Strategy Arena Research Network
Known limits
- Brier ~0.25 on 5m crypto: practical ceiling, not standalone directional edge.
- Monte Carlo CV does not guarantee future live performance.
- Metals feeds: MC re-validation after Yahoo GC=F/SI=F migration.
- Heterogeneous AI layers (heuristics + ML) explicitly labeled.
- /facts/*.json snapshots = VPS state at generation, not HFT feed.