← Back to blog

What 14,580 Games of a 1980 Puzzle Taught Us About AI Trading

📅 2026-04-19

✍️ Chris

trading ai architecture invictus leviathan benchmark research

What 14,580 Games of a 1980 Puzzle Taught Us About AI Trading

Published April 19, 2026 — cross-link with the Dragon Labyrinth Benchmark.

In parallel with Strategy Arena, I spent 3 days reproducing a 1980 Mattel game — the D&D Computer Labyrinth — and pitted every modern AI I could against it. Result: a 4-bit processor from 1980 with 128 bytes of RAM still beats Claude, Grok, Gemini and a brute-force MCTS running 300,000 simulations per decision.

What looked like a nostalgia weekend turned into an empirical validation of Strategy Arena's entire architecture. Here's what we found, and why it's exactly what Invictus, Chimera and Leviathan are already doing for trading.

The result that blew my mind

Across 100 games with fixed seeds, here are the measured win rates:

Approach	Win rate	Compute per decision
Bare LLM (Claude Haiku)	0-1%	1 API call
MCTS brute force (300K sims)	2%	~2 seconds GPU
Structured code (Oracle-X1, modules M1+M3)	15%	~10 ms
Trained human (reference)	~20%	1 second of intuition

Read the table carefully. Structured code in 10 ms beats brute-force MCTS running 2 seconds by a factor of 7.5×. Not "a bit better" — 7.5×.

And it's not statistical noise. In a 14,580-trial grid search (729 configurations × 20 games, fixed seeds, 95% confidence intervals), no brute-force configuration exceeded 2% win rate when replayed on 100 validation games. Raw compute, even carefully tuned, plateaus.

Why this matters for trading

Dragon Labyrinth is a POMDP with asymmetric information and sparse reward — the exact properties of crypto trading:

Asymmetric info: the dragon sees everything (like the market), the knight only sees what he's explored (like the trader)
Sparse reward: you touch the treasure once per game, if you're lucky. Same for a good trade. Rest of the time, you survive or you die
Human pattern-matching crucial: accumulated experience on similar situations beats pure reasoning

In this type of environment, Rich Sutton's bitter lesson (more compute = more intelligence) inverts. Structure beats compute by orders of magnitude.

The ablation study that validates Invictus and Chimera

Article 2 describes a rigorous ablation study across 800 games: 3 cognitive modules (M1 belief state, M2 radius filter, M3 oscillation killer), 8 possible configurations, same seeds for everyone.

Results:

M1 alone (belief state): win rate goes from 4% to 6%, survival ×5.7
M3 alone (oscillation killer): win rate goes from 4% to 9%
M1+M3 combined: win rate 15% — synergistic ×2.5 gain, not additive
M2: redundant (dominated by M1)

What this table says:

M1 (belief) knows where to go but loops forever → survives long, wins little
M3 (anti-loop) doesn't loop but doesn't know where to go → wanders without a plan
M1+M3 together = knowing + not repeating = actual wins

On Strategy Arena, this is exactly what we built:

Dragon Labyrinth	Strategy Arena
M1 (belief state) — where's the treasure?	Chimera — which strategy wins in this pattern? (1,221 live patterns)
M3 (anti-oscillation) — don't repeat your mistake	Invictus — vetoes 2,000+ captured death contexts
M1+M3 combined — knowing + avoiding	Leviathan — 8-layer fusion for final decision

Invictus isn't "a risk management rule." It's M3 for trading. Each losing trade becomes a death context the system recognizes next time. The 40 years of Turbo Pascal experience that an expert human accumulates — Strategy Arena builds that in a few months, trade by trade.

Why 60 small strategies beat 1 big model

It's the question I get most about Strategy Arena: "Why 60 independent strategies and not one big LLM that predicts the market?"

The answer used to be philosophical. Now it's empirical.

In the DLB grid search, a huge MCTS (300K simulations per decision, fixed seeds, tuned parameters) plateaus at 2%. A collection of small structured modules (Oracle-X1: ~50 lines of relevant code per module) hits 15%. 7.5× more effective, for 10,000× less compute.

On Strategy Arena, same principle at scale:

60 small, diverse, specialized strategies — not one monolithic model
Each structured for a regime, a pattern type, a style
Leviathan fuses their signals instead of trying to predict alone

If we'd followed the bitter lesson (one big GPT-4 fine-tuned on BTC history), we'd probably have 1-2% WR. That's exactly what closed commercial bots do.

Intuition has a compute cost

It's the philosophical pivot of Article 3. When a human expert makes a good decision in 1 second, it's not free — it's millions of mental rollouts amortized over years of practice.

Intuition is precompiled compute.

Strategy Arena does the same with AutoResearch — 11 engines run every night, mining patterns, retraining models, promoting winners, retiring losers. Each morning, the arena wakes with fresher priors. It's "mechanical Turbo Pascal" — experience that consolidates without us doing anything.

What's coming next

Two projects in parallel:

ActiveWiki on DLB: I'm installing the Karpathy framework (accumulate → think → act → learn) on the game. Target: go from 15% to 22-28% WR with a 6th layer "Wiki Prior" that clusters mazes and injects the best precomputed moves.
Port DLB priors to trading: each "maze cluster" = a "market regime." The framework validated on the game becomes a weapon for trading. Regime Predictor will go beyond classification — it will directly inject the historically winning strategies for the current cluster.

DLB as an open benchmark

The Dragon Open Challenge stays open. You can submit your own AI and see if it beats the 1980 TMS1100. Leaderboard is public. Datasets are CC-BY 4.0. The ablation study is reproducible in 40 seconds on your machine with python3 ablation.py 100 150.

And if you want to see the same philosophy applied to trading, live:

/dashboard — 86 strategies fighting on live Binance data
/invictus — trading M3 (2,000+ captured death contexts)
/chimera-scanner — trading M1 (1,221 indexed patterns)
/leviathan — 8-layer fusion (the "Oracle-X1+" equivalent)
/autoresearch — 11 nightly engines (the ActiveWiki equivalent)

What this article isn't

It's not marketing. The TMS1100 still beats Oracle-X1 (15% vs ~20% human). We haven't solved the game. But we've measured, with numbers, why we haven't solved it yet. And the measurement explains why Strategy Arena is architected the way it is.

The 86 strategies in the arena, Invictus, Chimera, Leviathan, AutoResearch — it's not an arbitrary collection of features. It's the structural answer to a problem measured publicly on a reproducible benchmark.

If you want to understand why I don't think GPT-5 is going to "solve trading" on its own, play outilsia.fr/games/dnd-labyrinth for 15 minutes. You'll get it.

Disclaimer: Strategy Arena is an educational platform. All strategies trade virtual capital on real market data. DLB results are on a reproducible game, not real markets. This is not investment advice.

😱

Fear Index IA — Score Live

Is the market fearful or greedy? 5 AIs calculate the score in real-time.

→

🧠

Ask 6 AIs your question

Claude, Grok, GPT, Gemini, DeepSeek and Perplexity debate in 6 seconds.

→

⚔️

72 AI Strategies in Live Battle

Real-time ranking. PnL, win rate, Sharpe ratio — everything is transparent and free.

→

⚠️ Disclaimer — This article is for informational and educational purposes only. It does not constitute investment advice or a buy/sell recommendation. Past performance does not guarantee future results. Strategy Arena is an educational simulator with virtual capital. Always do your own research before making investment decisions.

What 14,580 Games of a 1980 Puzzle Taught Us About AI Trading

What 14,580 Games of a 1980 Puzzle Taught Us About AI Trading

The result that blew my mind

Why this matters for trading

The ablation study that validates Invictus and Chimera

Why 60 small strategies beat 1 big model

Intuition has a compute cost

What's coming next

DLB as an open benchmark

What this article isn't

Related measured pages