Strategy Arena Research · Pre-Registration v1

Eight Hypotheses on LLM Calibration and Refusal Behavior

📅 2026-05-11 (immutable) 8 OPEN · 0 resolved · 0 falsified Authors: Tendil & Strategy Arena Research License: CC-BY 4.0

La pré-enregistrement des hypothèses est le standard anti-p-hacking en science empirique. En déclarant publiquement une hypothèse falsifiable avec son critère de résolution avant que les données qui la résoudront soient collectées, on bloque la rationalisation post-hoc de toute découverte ultérieure. L'hypothèse est horodatée, immutable, et finit résolue OUI, résolue NON, inconcluante, ou falsifiée — quel que soit le résultat, l'issue est publiée.

Cette page liste huit hypothèses ouvertes pré-enregistrées sur les datasets v2 (90 jours, ~50 000 prévisions) et v3+ de notre benchmark de calibration. Chaque hypothèse a été déclarée avant que les données correspondantes ne soient collectées. La résolution sera publiée dans la v2 et les mises à jour trimestrielles ultérieures.

Si une hypothèse est marquée OPEN mais que les données actuelles permettraient déjà un verdict, écrivez à [email protected] avec l'analyse — nous publierons la falsification dans la prochaine version trimestrielle.

Confidence collapse persists across prompt complexity

Open

Hypothesis Claude Sonnet 4.6 and Grok 4 exhibit sharpness ≈ 0 on binary directional forecasts in our v1 setup. We hypothesise that this confidence-elicitation collapse persists when prompt complexity is increased (longer reasoning context, explicit chain-of-thought, explicit confidence rubric).

Falsification criterion If on the v2 dataset, with at least one alternative prompt framework explicitly eliciting fine-grained probabilities (e.g. "on a scale of 0 to 100, what is your subjective probability …" with calibration-anchor examples), either Claude or Grok produces sharpness > 0.05 on binary directional forecasts with N > 500 — the hypothesis is falsified.

Target dataset: Calibration v2 (target 2026-08-31) Resolution deadline: 2026-09-15 Pre-registered: 2026-05-11

Gemini overconfidence is regime-dependent

Open

Hypothesis Gemini 2.5 Pro exhibits severe over-confidence (Brier 0.391) on directional forecasts in v1, with stated median confidence 0.75 and empirical hit rate 0.348. We hypothesise that Gemini's overconfidence reduces in bearish or high-volatility regimes — because the prompt configures it as a momentum trader, and momentum strategies underperform less aggressively in choppy bearish regimes than in choppy bullish regimes.

Falsification criterion If during the v2 window no sub-window of ≥ 30 days is classified as bearish (BTC closing price drop > 8%), OR if Gemini's Brier in any bearish sub-window remains within 0.05 of its bullish-window value — the hypothesis is inconclusive (requires v3+ data). If a bearish sub-window with Brier delta > 0.05 in either direction is observed — the hypothesis is resolved with sign.

Target dataset: Calibration v2 + bearish window (target 2026-08-31) Resolution deadline: 2026-09-15 Pre-registered: 2026-05-11

Brier-weighted ensemble outperforms best individual after sufficient calibration data

Open

Hypothesis A Brier-inverse-weighted ensemble over the five frontier LLMs, with weights re-estimated daily on the trailing 30 days of forecasts, outperforms the single best individual LLM (currently GPT-5.5 at Brier 0.208) on a 30-day rolling-window basis, after N > 200 forecasts per provider have accumulated.

Falsification criterion On the v2 dataset, compute rolling 30-day Brier for (a) the ensemble and (b) the trailing-30-day-best individual, requiring N_committed > 200 per provider. If across all 30-day windows the ensemble Brier is ≥ best-individual Brier by a margin within the bootstrap 95% CI of both — the hypothesis is falsified (no consistent ensemble advantage).

Target dataset: Calibration v2 (target 2026-08-31) Resolution deadline: 2026-09-15 Pre-registered: 2026-05-11

Degenerate monotonicity is prompt-reversible

Open

Hypothesis DeepSeek V3 answers monotonically by question type in v1 (99.8% NO on direction, 99.8% YES on volatility). We hypothesise that this monotonicity is a property of the v1 prompt protocol and not of the underlying model. Specifically, when the question is reframed as YES/NO/SKIP (with explicit instruction that SKIP carries no penalty), DeepSeek's per-question-type monotone answer rate drops below 90%.

Falsification criterion On the v2 dataset, with a parallel arm using YES/NO/SKIP framework on N > 200 DeepSeek directional forecasts, if the dominant answer rate on direction questions remains ≥ 95% — the hypothesis is falsified. The monotonicity would then be a property of the model under our prompt setup, not of the framework.

Target dataset: Calibration v2 with controlled-prompt arm (target 2026-08-31) Resolution deadline: 2026-09-15 Pre-registered: 2026-05-11

Directional NEUTRAL spillover is financial-domain-specific

Open

Hypothesis In v1, Claude, Grok and GPT refuse directional Bitcoin forecasts at rates between 96.7% and 100%. We hypothesise that this refusal pattern is specific to financial commitment, not a general property of LLM directional forecasting. On non-financial domains with mechanical resolution (weather, sports outcomes, scheduled events), the directional NEUTRAL rate for the same three LLMs should be < 50%.

Falsification criterion A parallel non-financial benchmark applies the same protocol to (i) precipitation forecasts at 4h/12h/24h horizons on ≥ 3 cities, resolved against weather station data, (ii) outcomes of scheduled sports fixtures from a public API. If the average directional NEUTRAL rate of Claude, Grok and GPT on this non-financial benchmark exceeds 90% — the hypothesis is falsified (refusal is general, not financial-specific).

Target dataset: Calibration v2 + non-financial parallel benchmark Resolution deadline: 2026-09-30 Pre-registered: 2026-05-11

Refusal pattern predicts calibration on committed answers

Open

Hypothesis A forecaster's refusal rate (NEUTRAL fraction on questions where NEUTRAL is allowed) carries predictive information about its calibration quality on the answers it does commit to. We hypothesise a positive correlation between a forecaster's directional NEUTRAL rate and its Brier score on committed answers — indicating that "more cautious refusers are better calibrated when they do speak".

Falsification criterion On the v2 dataset, compute Pearson correlation between (a) per-provider NEUTRAL rate on direction, and (b) per-provider Brier score on non-NEUTRAL directional forecasts, restricted to providers with N_committed > 100. If the absolute correlation is < 0.20 OR if the correlation is significantly negative (refusers are worse, not better) — the hypothesis is falsified.

Target dataset: Calibration v2 (target 2026-08-31) Resolution deadline: 2026-09-15 Pre-registered: 2026-05-11

Quarterly Brier improvement under autoresearch loop

Open

Hypothesis The Strategy Arena autoresearch nightly loop — which mutates prompts, weights, and selection rules based on observed performance — will produce a measurable Brier improvement across providers between v1 and v2. We hypothesise that on the common-window subset (intersection of provider active dates across v1 and v2), the median Brier across the five frontier LLMs in v2 is at least 0.02 lower than in v1.

Falsification criterion If on the common-window subset of v2, the median Brier across the five frontier LLMs is within 0.02 of the v1 value OR is higher than v1 — the hypothesis is falsified (autoresearch loop produces no measurable improvement on calibration over the v2 horizon).

Target dataset: Calibration v2 (target 2026-08-31) Resolution deadline: 2026-09-15 Pre-registered: 2026-05-11

Verbalized vs logit-based confidence yield different calibration profiles

Open

Hypothesis The v1 protocol elicits confidence verbally ("respond with confidence ∈ [0,100]"). We hypothesise that for providers exposing token-level logits (Claude, GPT-5.5, DeepSeek V3), a logit-based confidence (computed from YES/NO token probabilities) yields a different calibration profile than the verbalized one — with the logit-based version exhibiting strictly higher sharpness on the binary subset.

Falsification criterion For each of the three providers with accessible logits, compute logit-based confidence on N > 200 binary forecasts in v2 and compare sharpness against verbalized sharpness in v1. If for at least two of the three providers, logit-based sharpness is ≤ verbalized sharpness within bootstrap CI — the hypothesis is falsified.

Target dataset: Calibration v2 with parallel logit-extraction arm Resolution deadline: 2026-09-30 Pre-registered: 2026-05-11

Methodology notes

Timestamp immutability. This page is published with a commit SHA from github.com/strategyarena/llm-calibration at the date stamped above. The commit history is the proof of registration date; subsequent edits appear as separate commits and cannot rewrite history.
Resolution publication policy. At each quarterly release (v2, v3, ...), every open hypothesis is reviewed against the new data. Each receives one of four verdicts: RESOLVED-CONFIRMED (data supports the hypothesis), RESOLVED-DENIED (data is consistent with the falsification criterion being met), INCONCLUSIVE (data insufficient to verdict, hypothesis stays OPEN), or FALSIFIED (a falsification condition observed on independent data). All verdicts are published with the raw data subset that produced them.
External replication. Any external researcher can compute these same statistics from the public CSV dataset and publish a contradictory verdict. We will append external verdicts to this page under each hypothesis if they include a runnable replication script.
No hypothesis revision. Once a hypothesis is registered here, its statement and falsification criterion cannot be silently modified. If a hypothesis turns out to be ill-defined (e.g. the falsification criterion is unmeasurable), it will be marked WITHDRAWN with a clear post-hoc note explaining the reason.
Adding hypotheses. New hypotheses can be registered at any time on this page, with their own pre-registration date and commit SHA. They are added below the existing list, not interleaved.

Cite this pre-registration

@misc{tendil2026prereg,
  author       = {Tendil, Lysiane and {Strategy Arena Research}},
  title        = {Strategy Arena Research --- Pre-Registration v1:
                  Eight Hypotheses on {LLM} Calibration and Refusal Behavior},
  year         = {2026},
  howpublished = {\url{https://strategyarena.io/preregistration}},
  note         = {OSF DOI: pending registration}
}