Travaux empiriques ouverts sur la calibration des LLM, la prévision multi-agent, et les systèmes cognitifs appliqués à la prédiction financière. Tous les datasets CC-BY 4.0. Tous les papiers horodatés sur arXiv ou OSF.
4 active papers
Preprint · in submission
Calibration as Capital, Refusal as Information: 5 Frontier LLMs and 4 Meta-Agents on Bitcoin Forecasting
Lysiane Tendil and Strategy Arena Research · May 2026
We present a public benchmark of 8,836 verified hourly forecasts issued by 5 frontier LLMs (Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Pro, Grok 4, DeepSeek V3) and 4 in-house meta-agents on Bitcoin price prediction over a 16-day window. We document four qualitatively distinct calibration failure modes — confidence elicitation collapse, degenerate monotonicity, severe over-confidence, and directional NEUTRAL spillover (refusal rates 96.7%-100% on directional questions, consistent with RLHF safety training extending to numerical confidence elicitation). Pre-registered, reproducible at $30 USD in API spend.
@article{tendil2026calibration,
author = {Tendil, Lysiane and {Strategy Arena Research}},
title = {Calibration as Capital, Refusal as Information:
5 Frontier LLMs and 4 Meta-Agents on Bitcoin Forecasting},
journal = {arXiv preprint},
year = {2026},
note = {In submission to arXiv (cs.LG)},
url = {https://strategyarena.io/research}
}
8,836 verified forecasts9 forecasters16-day windowv1 PDF · 2026-05-17CC-BY 4.0
Published · 3 papers
Dragon Labyrinth Benchmark — Structure Beats Compute: A 14,580-Trial Cognitive Audit of the 1980 Mattel TMS1100 vs Modern Frontier LLMs
Strategy Arena Research · April 2026
We reproduce the 1980 Mattel TMS1100 4-bit Dungeons & Dragons Computer Labyrinth board game in software and benchmark its 1.2 KB ROM agent against modern frontier LLMs on 14,580 trials with fixed random seeds and seven configuration ablations. The 1980 toy achieves 85% win rate against frontier LLMs (1-2% baseline). A structured multi-layer agent (Oracle-X1) reaches 15% — a 7.5× outperformance over pure 300,000-rollout MCTS. The architectural lessons directly informed Strategy Arena's research modules (Chimera, Invictus, Leviathan).
@article{strategyarena2026dragon,
author = {{Strategy Arena Research}},
title = {Dragon Labyrinth Benchmark --- Structure Beats Compute:
A 14{,}580-Trial Cognitive Audit of the 1980 Mattel
TMS1100 vs Modern Frontier LLMs},
year = {2026},
url = {https://outilsia.fr/dnd-challenge}
}
14,580 trials7 configurationsv1 PDF · 2026-05-17CC-BY 4.0 datasetCompanion to Calibration v1
Companion · Pre-registration
Strategy Arena Research — Pre-Registration v1: Eight Hypotheses on LLM Calibration and Refusal Behavior
Lysiane Tendil and Strategy Arena Research · May 2026 (immutable timestamp)
Eight falsifiable hypotheses about the behaviour of frontier LLMs on financial forecasting, pre-registered before the v2 calibration dataset has been collected. Hypotheses cover confidence collapse persistence, regime-dependence of over-confidence, ensemble vs best-individual performance, prompt-reversibility of degenerate monotonicity, domain-specificity of refusal spillover, predictive value of refusal patterns, autoresearch loop improvement, and verbalized vs logit-based confidence elicitation. Each hypothesis carries a quantitative falsification criterion and a target dataset/deadline.
@misc{tendil2026prereg,
author = {Tendil, Lysiane and {Strategy Arena Research}},
title = {Strategy Arena Research --- Pre-Registration v1:
Eight Hypotheses on LLM Calibration and Refusal Behavior},
year = {2026},
howpublished = {\url{https://strategyarena.io/preregistration}}
}
8 open hypotheses0 resolved0 falsifiedOSF + Git double-stamp
Preprint - ActiveWiki
ActiveWiki: A RAG-Augmented POMDP Framework for Cognitive Benchmarking
Strategy Arena Research - May 2026
ActiveWiki generalizes the Dragon Labyrinth lessons into a retrieval-augmented controller for partially observable decision problems. Instead of asking a model to infer hidden topology from scratch, the system records solved cases, clusters them into reusable state patterns, and injects the nearest case memory at runtime. On the Dragon Labyrinth validation set, Oracle-X1 wins 142/1000 games, Oracle-X2 ActiveWiki wins 138/1000, and their union reaches 215/1000, showing that case retrieval is not a replacement for the structured controller but a complementary cognitive substrate.