Dragon Labyrinth Benchmark — Structure vs Compute

📅 v1 · 2026 🎲 14,580 trials 🔬 800 ablation games 🏷 CC-BY 4.0 💾 JSON dataset

For 45 years we have confused information-cheating with intelligence. When cheating is removed, structure beats compute by an order of magnitude. This page publishes the evidence, the method and the raw data — reproducible and CC-BY.

7.5×

Structure vs brute-force MCTS advantage

2.5×

M1+M3 non-additive synergy

300K

MCTS simulations / decision, plateaus

85%

TMS1100 win rate (cheating)

1-2%

Bare LLM / brute MCTS win rate

15%

Oracle-X1 (M1+M3) win rate — code only

The experiment in one paragraph

In 1980 Mattel shipped a handheld called Dragon Labyrinth Game. A 4-bit TMS1100, 64 bytes of ROM, 16 instructions. It ran a dragon that chased a player through a maze. The dragon won 85% of matches against humans — not because it was smart, but because it had the full game state (all cells), while the player had only line-of-sight vision. We reproduced the game faithfully in 2026 and removed the cheat: every agent gets the same partial observability. Then we tested 5 catégories of AI across 14,580 trials.

Results — ranked win rates

Approach	Win rate	Notes
🎰 TMS1100 (1980, cheating)	85%	Full game state access
👤 Trained human	20%	20/80 cohort reference
🧠 Oracle-X1 (M1+M3 code)	15%	Best code-only result · 7.5× MCTS
🔬 MCTS 300K sims/decision	2%	Plateaus — compute alone not enough
🤖 Bare LLM (Claude/Grok/GPT/Gemini)	1%	Spatial blindness, no world model

Ablation study — 800 games, fixed seeds

We isolated 4 cognitive modules and tested each alone, then in pairs. 95% confidence interval.

Belief state

Where is the target?

solo 6% WR

Radius filter

Dominated by M1, redundant

solo 4% WR

Oscillation killer

Anti-repeat behavior

solo 9% WR

M1+M3

Combined

Non-additive — synergy 2.5×

combined 15% WR

Two modules, each worth ~7% alone, combine to 15% — a non-additive synergy of 2.5×. M1 says where to look. M3 says how not to loop. Together they make a decision architecture, not just a heuristic pile.

Why it matters for AI trading

If structure beats compute on a well-defined POMDP like maze pursuit, the same thesis applies to crypto markets — partial observability, noisy signals, adversarial agents. Our arena at /bot-arena applies Oracle-X1's decomposition principle: each strategy is a stack of rules, not a monolithic model. Chimera (50 patterns), Invictus (2,000+ death contexts), Leviathan (8 cognitive layers) — all are structural decompositions.

Reproduce or extend

Play the game live — outilsia.fr/games/dnd-labyrinth
Public leaderboard — outilsia.fr/dnd-challenge
JSON dataset — /api/data/dlb-summary
Full scientific context — /scientific-foundation

⚔️ See the trading arena 🕹️ 1987 Amiga vs 2026 AI 📦 Download datasets

Strategy Arena — structure over compute, applied to live crypto.