Monte Carlo backtesting
How to estimate whether a crypto backtest is robust or overfit — bootstrap, low percentiles, calibration, and walk-forward drift, published in an open lab.
Strategy Arena Monte Carlo framework (5 steps)
Each step produces a verifiable public artifact — not just a Sharpe on one curve.
| Step | Component | Input | Output / gate | Proof |
|---|---|---|---|---|
| 1. Bootstrap | Trade / return resampling | Walk-forward backtest trade series | 1000 simulated PnL paths | monte-carlo.json |
| 2. Percentiles | p5 / p50 / p95 PnL & Sharpe | Bootstrap distribution | p5 PnL > 0 required | /facts/monte-carlo |
| 3. Robustness | Score 0–1 (subsample stability) | Inter-sim variance | Robustness > 0.6 | /live-results |
| 4. Calibration | Brier / reliability | Model probs vs outcomes | Published even when bad | /facts/ml-edge |
| 5. Drift | Walk-forward vs live paper | 5m paper equity | Alert if gap > threshold | /dashboard, /strategy-hospital |
Bootstrap assumes conditionally exchangeable trades — limit documented on /methodology (autocorrelation, regimes).
Five Monte Carlo pitfalls & StrategyArena fixes
- Too few trades — 100 sims on 8 trades = pure noise. Fix: min 30 walk-forward trades before MC; else WATCH at Hospital.
- Ignoring the low percentile (p5) — great median, catastrophic left tail. Fix: published p5 PnL; gate p5 ≤ 0 → RECALIBRATE / BUG_SUSPECT.
- i.i.d. bootstrap on autocorrelated returns — overstates confidence. Fix: experimental block bootstrap + mandatory walk-forward in pipeline.
- MC without fees / slippage — inflated percentiles. Fix: same friction model as backtest (methodology).
- Single MC pass to checkbox — no drift monitoring. Fix: monthly re-MC + live paper comparison; snapshots in monte-carlo.json.
Live Monte Carlo stats (updated: 2026-05-24)
Counts synced with strategy-arena.json when available; per-strategy MC detail in monte-carlo.json.
Researcher workflow
backtest → bootstrap (1000) → percentiles → robustness → calibration → drift check → hospital
Reproducibility: export trades from /backtest, compare to public JSON fields, then read Hospital status. For aggregated rules-based allocation, see /atlas-edge-allocator (still paper).
Monte Carlo FAQ
- Why 1000 simulations?
- Latency vs percentile stability tradeoff; documented on /methodology. Raising N reduces Monte Carlo noise, not market risk.
- Does MC replace paper trading?
- No. MC tests the historical distribution; paper tests live execution on 5m OHLCV (drift, bugs, data latency).
- Where are MC failures visible?
- Hospital (WATCH / RECALIBRATE), Research, and DEPRECATED strategies — see /trading-strategy-validation.
Quick MC glossary
| Term | Role | Link |
|---|---|---|
| Bootstrap | Resampling trades with replacement | /facts/monte-carlo |
| p5 / p95 | Simulated PnL distribution tails | monte-carlo.json |
| Robustness score | Stability under perturbations | /live-results |
| Walk-forward | Temporal split anti look-ahead | /backtest |
| Drift | Backtest vs paper gap | /dashboard |
Explicit limits
- Crypto / perps: fat tails — classical bootstrap may understate extremes.
- Regime shifts: a historical MC pass does not guarantee the future regime.
- Educational content; no profit promise.