Every architectural choice in Strategy Arena — 60 strategies instead of 1 big model, Invictus instead of a price predictor, Leviathan instead of a meta-LLM — is the answer to a measured result on a reproducible benchmark. Here is the evidence.
In April 2026 we reproduced the Mattel D&D Computer Labyrinth (1980) — an 8×8 board with an invisible dragon driven by a 4-bit TMS1100 processor — and pitted every modern AI against it. The results, published in 3 papers on outilsia.fr, are reproducible in 40 seconds on any laptop.
This benchmark has the same structural properties as crypto trading: POMDP (partial observability), sparse reward, asymmetric information, human pattern-matching crucial. What works on Dragon Labyrinth works on live trading. What fails on Dragon Labyrinth fails on live trading.
| Approach | Win rate | Compute / decision |
|---|---|---|
| Bare LLM (Claude Haiku) | 0-1% | 1 API call |
| MCTS brute force (300K sims) | 2% | ~2s GPU |
| Structured code (Oracle-X1, M1+M3) | 15% | ~10 ms |
| Trained human (reference) | ~20% | 1 sec intuition |
Structured code at 10 ms beats brute-force MCTS at 2 seconds by ×7.5. The grid search across 14,580 trials confirms: no brute-force configuration exceeds 2% when replayed. Compute alone plateaus. Structure does not.
A rigorous ablation study on 800 games with fixed seeds identified the minimum cognitive scaffolding needed to beat random play:
M1 alone knows where to go but loops. M3 alone doesn't loop but doesn't know where to go. Together, they win. Full study: outilsia.fr/blog/tms1100-vs-ia-2026-ablation.
Every cognitive layer identified in Dragon Labyrinth has its direct equivalent in Strategy Arena. This isn't a coincidence — it's the same architecture applied to a different domain.
M1 (belief state) — where is the treasure?
Chimera — 1,221 patterns, best strategy per context
M3 (oscillation killer) — don't repeat mistakes
Invictus — 2,000+ death contexts veto toxic buys
Prompt Layers — structured context for LLM
PromptForge — 12 context sources per decision
Hybrid MCTS + Oracle-X1 (Grok proposal)
Leviathan — 8-layer weighted fusion decision
Precompiled human intuition (40 years of practice)
AutoResearch — 11 nightly engines precompute priors
14,580 trials → 2% brute force, 15% structured
60 small diverse strategies > 1 monolithic model
Commercial AI trading bots (3Commas, Cryptohopper, Bitsgap) optimize for more compute — more backtests, more parameters, more ML models. Our benchmark says that direction plateaus at 2% effectiveness.
Strategy Arena optimizes for more structure — more cognitive layers, more specialized small strategies, more memory of past failures. That direction hits 15%. Same POMDP, different architecture, order-of-magnitude difference.
This is the testable proof. You can run it yourself. You can extend it. You can disprove it. The benchmark is open. The datasets are CC-BY 4.0. The ablation study reproduces in 40 seconds.
No closed commercial bot publishes anything like this. That's the moat.
The theory is measured. The implementation is live. Watch it run:
Strategy Arena is an educational platform. All strategies trade virtual capital on real live market data. Dragon Labyrinth Benchmark results are on a reproducible game environment, not real markets. This page documents the reasoning behind our architectural choices — it is not investment advice. Past simulated or benchmarked performance does not guarantee future results.