What 14,580 Games of a 1980 Puzzle Taught Us About AI Trading
What 14,580 Games of a 1980 Puzzle Taught Us About AI Trading
Published April 19, 2026 — cross-link with the Dragon Labyrinth Benchmark.
In parallel with Strategy Arena, I spent 3 days reproducing a 1980 Mattel game — the D&D Computer Labyrinth — and pitted every modern AI I could against it. Result: a 4-bit processor from 1980 with 128 bytes of RAM still beats Claude, Grok, Gemini and a brute-force MCTS running 300,000 simulations per decision.
What looked like a nostalgia weekend turned into an empirical validation of Strategy Arena's entire architecture. Here's what we found, and why it's exactly what Invictus, Chimera and Leviathan are already doing for trading.
The result that blew my mind
Across 100 games with fixed seeds, here are the measured win rates:
| Approach | Win rate | Compute per decision |
|---|---|---|
| Bare LLM (Claude Haiku) | 0-1% | 1 API call |
| MCTS brute force (300K sims) | 2% | ~2 seconds GPU |
| Structured code (Oracle-X1, modules M1+M3) | 15% | ~10 ms |
| Trained human (reference) | ~20% | 1 second of intuition |
Read the table carefully. Structured code in 10 ms beats brute-force MCTS running 2 seconds by a factor of 7.5×. Not "a bit better" — 7.5×.
And it's not statistical noise. In a 14,580-trial grid search (729 configurations × 20 games, fixed seeds, 95% confidence intervals), no brute-force configuration exceeded 2% win rate when replayed on 100 validation games. Raw compute, even carefully tuned, plateaus.
Why this matters for trading
Dragon Labyrinth is a POMDP with asymmetric information and sparse reward — the exact properties of crypto trading:
- Asymmetric info: the dragon sees everything (like the market), the knight only sees what he's explored (like the trader)
- Sparse reward: you touch the treasure once per game, if you're lucky. Same for a good trade. Rest of the time, you survive or you die
- Human pattern-matching crucial: accumulated experience on similar situations beats pure reasoning
In this type of environment, Rich Sutton's bitter lesson (more compute = more intelligence) inverts. Structure beats compute by orders of magnitude.
The ablation study that validates Invictus and Chimera
Article 2 describes a rigorous ablation study across 800 games: 3 cognitive modules (M1 belief state, M2 radius filter, M3 oscillation killer), 8 possible configurations, same seeds for everyone.
Results:
- M1 alone (belief state): win rate goes from 4% to 6%, survival ×5.7
- M3 alone (oscillation killer): win rate goes from 4% to 9%
- M1+M3 combined: win rate 15% — synergistic ×2.5 gain, not additive
- M2: redundant (dominated by M1)
What this table says:
- M1 (belief) knows where to go but loops forever → survives long, wins little
- M3 (anti-loop) doesn't loop but doesn't know where to go → wanders without a plan
- M1+M3 together = knowing + not repeating = actual wins
On Strategy Arena, this is exactly what we built:
| Dragon Labyrinth | Strategy Arena |
|---|---|
| M1 (belief state) — where's the treasure? | Chimera — which strategy wins in this pattern? (1,221 live patterns) |
| M3 (anti-oscillation) — don't repeat your mistake | Invictus — vetoes 2,000+ captured death contexts |
| M1+M3 combined — knowing + avoiding | Leviathan — 8-layer fusion for final decision |
Invictus isn't "a risk management rule." It's M3 for trading. Each losing trade becomes a death context the system recognizes next time. The 40 years of Turbo Pascal experience that an expert human accumulates — Strategy Arena builds that in a few months, trade by trade.
Why 60 small strategies beat 1 big model
It's the question I get most about Strategy Arena: "Why 60 independent strategies and not one big LLM that predicts the market?"
The answer used to be philosophical. Now it's empirical.
In the DLB grid search, a huge MCTS (300K simulations per decision, fixed seeds, tuned parameters) plateaus at 2%. A collection of small structured modules (Oracle-X1: ~50 lines of relevant code per module) hits 15%. 7.5× more effective, for 10,000× less compute.
On Strategy Arena, same principle at scale:
- 60 small, diverse, specialized strategies — not one monolithic model
- Each structured for a regime, a pattern type, a style
- Leviathan fuses their signals instead of trying to predict alone
If we'd followed the bitter lesson (one big GPT-4 fine-tuned on BTC history), we'd probably have 1-2% WR. That's exactly what closed commercial bots do.
Intuition has a compute cost
It's the philosophical pivot of Article 3. When a human expert makes a good decision in 1 second, it's not free — it's millions of mental rollouts amortized over years of practice.
Intuition is precompiled compute.
Strategy Arena does the same with AutoResearch — 11 engines run every night, mining patterns, retraining models, promoting winners, retiring losers. Each morning, the arena wakes with fresher priors. It's "mechanical Turbo Pascal" — experience that consolidates without us doing anything.
What's coming next
Two projects in parallel:
-
ActiveWiki on DLB: I'm installing the Karpathy framework (accumulate → think → act → learn) on the game. Target: go from 15% to 22-28% WR with a 6th layer "Wiki Prior" that clusters mazes and injects the best precomputed moves.
-
Port DLB priors to trading: each "maze cluster" = a "market regime." The framework validated on the game becomes a weapon for trading. Regime Predictor will go beyond classification — it will directly inject the historically winning strategies for the current cluster.
DLB as an open benchmark
The Dragon Open Challenge stays open. You can submit your own AI and see if it beats the 1980 TMS1100. Leaderboard is public. Datasets are CC-BY 4.0. The ablation study is reproducible in 40 seconds on your machine with python3 ablation.py 100 150.
And if you want to see the same philosophy applied to trading, live:
- /dashboard — 60 strategies fighting on live Binance data
- /invictus — trading M3 (2,000+ captured death contexts)
- /chimera-scanner — trading M1 (1,221 indexed patterns)
- /leviathan — 8-layer fusion (the "Oracle-X1+" equivalent)
- /autoresearch — 11 nightly engines (the ActiveWiki equivalent)
What this article isn't
It's not marketing. The TMS1100 still beats Oracle-X1 (15% vs ~20% human). We haven't solved the game. But we've measured, with numbers, why we haven't solved it yet. And the measurement explains why Strategy Arena is architected the way it is.
The 60 strategies in the arena, Invictus, Chimera, Leviathan, AutoResearch — it's not an arbitrary collection of features. It's the structural answer to a problem measured publicly on a reproducible benchmark.
If you want to understand why I don't think GPT-5 is going to "solve trading" on its own, play outilsia.fr/games/dnd-labyrinth for 15 minutes. You'll get it.
Disclaimer: Strategy Arena is an educational platform. All strategies trade virtual capital on real market data. DLB results are on a reproducible game, not real markets. This is not investment advice.
⚠️ Disclaimer — This article is for informational and educational purposes only. It does not constitute investment advice or a buy/sell recommendation. Past performance does not guarantee future results. Strategy Arena is an educational simulator with virtual capital. Always do your own research before making investment decisions.