Skip to main content
Strategy Arena Research · Papers

Publications and Preprints

Travaux empiriques ouverts sur la calibration des LLM, la prévision multi-agent, et les systèmes cognitifs appliqués à la prédiction financière. Tous les datasets CC-BY 4.0. Tous les papiers horodatés sur arXiv ou OSF.

4 active papers
Preprint · in submission

Calibration as Capital, Refusal as Information: 5 Frontier LLMs and 4 Meta-Agents on Bitcoin Forecasting

Lysiane Tendil and Strategy Arena Research · May 2026

We present a public benchmark of 8,836 verified hourly forecasts issued by 5 frontier LLMs (Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Pro, Grok 4, DeepSeek V3) and 4 in-house meta-agents on Bitcoin price prediction over a 16-day window. We document four qualitatively distinct calibration failure modes — confidence elicitation collapse, degenerate monotonicity, severe over-confidence, and directional NEUTRAL spillover (refusal rates 96.7%-100% on directional questions, consistent with RLHF safety training extending to numerical confidence elicitation). Pre-registered, reproducible at $30 USD in API spend.

@article{tendil2026calibration,
  author = {Tendil, Lysiane and {Strategy Arena Research}},
  title  = {Calibration as Capital, Refusal as Information:
            5 Frontier LLMs and 4 Meta-Agents on Bitcoin Forecasting},
  journal = {arXiv preprint},
  year   = {2026},
  note   = {In submission to arXiv (cs.LG)},
  url    = {https://strategyarena.io/research}
}
8,836 verified forecasts9 forecasters16-day windowv1 PDF · 2026-05-17CC-BY 4.0
Published · 3 papers

Dragon Labyrinth Benchmark — Structure Beats Compute: A 14,580-Trial Cognitive Audit of the 1980 Mattel TMS1100 vs Modern Frontier LLMs

Strategy Arena Research · April 2026

We reproduce the 1980 Mattel TMS1100 4-bit Dungeons & Dragons Computer Labyrinth board game in software and benchmark its 1.2 KB ROM agent against modern frontier LLMs on 14,580 trials with fixed random seeds and seven configuration ablations. The 1980 toy achieves 85% win rate against frontier LLMs (1-2% baseline). A structured multi-layer agent (Oracle-X1) reaches 15% — a 7.5× outperformance over pure 300,000-rollout MCTS. The architectural lessons directly informed Strategy Arena's research modules (Chimera, Invictus, Leviathan).

@article{strategyarena2026dragon,
  author = {{Strategy Arena Research}},
  title  = {Dragon Labyrinth Benchmark --- Structure Beats Compute:
            A 14{,}580-Trial Cognitive Audit of the 1980 Mattel
            TMS1100 vs Modern Frontier LLMs},
  year   = {2026},
  url    = {https://outilsia.fr/dnd-challenge}
}
14,580 trials7 configurationsv1 PDF · 2026-05-17CC-BY 4.0 datasetCompanion to Calibration v1
Companion · Pre-registration

Strategy Arena Research — Pre-Registration v1: Eight Hypotheses on LLM Calibration and Refusal Behavior

Lysiane Tendil and Strategy Arena Research · May 2026 (immutable timestamp)

Eight falsifiable hypotheses about the behaviour of frontier LLMs on financial forecasting, pre-registered before the v2 calibration dataset has been collected. Hypotheses cover confidence collapse persistence, regime-dependence of over-confidence, ensemble vs best-individual performance, prompt-reversibility of degenerate monotonicity, domain-specificity of refusal spillover, predictive value of refusal patterns, autoresearch loop improvement, and verbalized vs logit-based confidence elicitation. Each hypothesis carries a quantitative falsification criterion and a target dataset/deadline.

@misc{tendil2026prereg,
  author = {Tendil, Lysiane and {Strategy Arena Research}},
  title  = {Strategy Arena Research --- Pre-Registration v1:
            Eight Hypotheses on LLM Calibration and Refusal Behavior},
  year   = {2026},
  howpublished = {\url{https://strategyarena.io/preregistration}}
}
8 open hypotheses0 resolved0 falsifiedOSF + Git double-stamp
Preprint - ActiveWiki

ActiveWiki: A RAG-Augmented POMDP Framework for Cognitive Benchmarking

Strategy Arena Research - May 2026

ActiveWiki generalizes the Dragon Labyrinth lessons into a retrieval-augmented controller for partially observable decision problems. Instead of asking a model to infer hidden topology from scratch, the system records solved cases, clusters them into reusable state patterns, and injects the nearest case memory at runtime. On the Dragon Labyrinth validation set, Oracle-X1 wins 142/1000 games, Oracle-X2 ActiveWiki wins 138/1000, and their union reaches 215/1000, showing that case retrieval is not a replacement for the structured controller but a complementary cognitive substrate.

@article{strategyarena2026activewiki,
  author = {{Strategy Arena Research}},
  title  = {ActiveWiki: A RAG-Augmented POMDP Framework for
            Cognitive Benchmarking},
  year   = {2026},
  note   = {Preprint bundle prepared},
  url    = {https://strategyarena.io/active-wiki}
}
5,000 training games8 topology clusters21.5% selector unionRAG for POMDPCC-BY 4.0

Roadmap — Planned papers

Not yet shipped. Listed publicly as commitments, not aspirations.

🔘 Planned · target Aug 2026
Calibration v2 — 90-day extension

Resolves hypotheses H1, H3, H4, H6, H7, H8 from Pre-Registration v1. Target: ~50,000 verified forecasts.

🔘 Planned · target Q3-Q4 2026
Polymarket Companion Benchmark

Same 9-LLM consensus on resolved Polymarket prediction markets. Status: open-bet phase at v1 release; first resolved cohort pending.

🔘 Planned · target Q4 2026
Non-Financial Refusal Spillover

Tests H5: is directional NEUTRAL spillover financial-domain-specific? Domains: weather forecasting, scheduled sports outcomes.