Research · Strategy Arena

A Software 3.0 Cognitive System Applied to Trading

Chris Lacombe · drafted with Claude (Anthropic), Grok (xAI), DeepSeek
May 2026 · Download dataset (CC-BY 4.0) · Live calibration API

Abstract Strategy Arena is a multi-agent cognitive system that paper-trades cryptocurrencies using 60+ strategies, including 6 distinct research architectures: a 1,221-pattern micro-models bank (Chimera), a non-linear neural fusion layer (Leviathan), an XGBoost meta-learner over all strategies' histories (Hydra), a deterministic quantum-inspired circuit (QuantumCollapse), a heat-equation PDE solver applied to multi-timeframe momentum (MomentumDiffusion), and a genetic multi-agent debate engine (DebateForge). Above them sits a four-layer meta-cognitive stack (Oracle / Meta Intelligence / Leviathan / Chimera) and a supervised failure-mode classifier (Invictus) that vetoes trades matching known loss patterns. A nightly Karpathy-style autoresearch loop accumulates data, consults LLMs, mutates parameters, and updates a living wiki — replacing static RAG with continuously rewritten institutional memory. The platform exposes its internal cognitive state through 8 public JSON endpoints, including empirical Brier-score calibration of 9 LLMs (Claude, GPT, Gemini, Grok, DeepSeek, Perplexity, Mistral, Qwen, Llama) measured over 8,718 verified forecasts. All datasets are released under CC-BY 4.0. This document describes the system, its scientific basis, its measurable claims, and its open questions.

1. Motivation: Software 3.0 needs a testbed

In State of GPT (2023) and subsequent talks, Andrej Karpathy proposed that large language models constitute a third programming paradigm — alongside Software 1.0 (hand-written code) and Software 2.0 (learned weights). In Software 3.0, programs are written in natural language, executed by LLMs, and composed into agentic loops that accumulate, think, act, and learn [1]. Trading is an unusually clean testbed for this idea, because outcomes resolve in hours, capital is virtual (paper trading carries zero counterparty risk), and the ground truth — did the price go up or not — admits no interpretation.

We built Strategy Arena over four months as a solo bootstrapped project to test how far Software 3.0 ideas can be pushed in a domain that punishes mistakes. The result is not a trading bot. It is a cognitive system with its own domain-specific language, its own institutional memory, its own immune system against repeating failures, and an audit surface that exposes every internal state through public APIs.

This document is the first written description of what was built. The site has been positioned publicly as a cryptocurrency AI trading platform, but that framing under-describes the artifact. The audience we are trying to reach with this document is researchers, engineers, and observers who work on multi-LLM systems, calibration, and agentic loops — not the retail crypto trader.

2. Architecture overview

The system has four levels stacked above the trading arena:

Figure 1. Cognitive hierarchy. AutoResearch nightly cycle rewrites the upper layers' prompts and weights; decisions flow upward from the arena, memory flows downward through ActiveWiki.

Decisions flow upward (arena strategies emit signals, meta layers aggregate). Memory flows downward (AutoResearch and ActiveWiki rewrite the prompts, parameters, and selection weights of the layers below). Invictus operates as a veto in parallel: any buy signal that matches a previously-clustered failure context is blocked, regardless of how confident the upper layers were.

3. The six strategy architectures

Most of the 60+ strategies in the arena are conventional: buy-and-hold, DCA, RSI mean-reversion, Donchian channels, etc. Six are unusual enough to warrant individual description, because each implements a distinct scientific thesis.

3.1 Chimera — micro-models retrieval

Chimera maintains a bank of 1,221 distinct market patterns extracted from historical data, each indexed semantically against the performance of the 1,196 strategies that previously matched it. At inference time, the current price action is converted into a feature vector and compared by cosine similarity against the entire bank:

similarity = max_k cos(currentₜ, patternₖ), k ∈ [1, 1,221]

The matched pattern's historical winning strategy is then promoted as the "chameleon" signal for the current tick. Chimera contains 51 million indexed pattern occurrences across the historical archive, occupying ~59 MB. This is consistent with Karpathy's micro-models advocacy: rather than one large model attempting to predict, 1,196 small specialised models are retrieved by similarity. See backend/chimera/ in the source for the indexing pipeline.

3.2 Leviathan — non-linear fusion

Leviathan ingests Chimera's pattern signal, Hydra's meta-learner output, the votes of six LLMs, news sentiment, and a contrarian filter, then produces a single BUY/SELL/HOLD decision via a learned non-linear function. Its formula in current form is:

decision = f_NN(Chimera_state, Hydra_state, LLM_votes, sentiment, contrarian)

Leviathan was validated indirectly on the Dragon Labyrinth Benchmark [2], a published reproduction of the 1980 Mattel D&D Computer Labyrinth board game played by both modern AIs and the original 4-bit TMS1100 processor. In a 14,580-trial grid search, a pure brute-force MCTS plateaued at 2% win rate; a structured multi-layer agent (Oracle-X1) reached 15%, a 7.5× outperformance. Leviathan applies the same lesson: structure beats compute.

3.3 Hydra — XGBoost meta-learner

Hydra trains an XGBoost classifier on the historical trade outcomes of every strategy in the arena. At each tick, it asks: "if each strategy were to trade now, would it be profitable?" The 16 input features include RSI, ATR%, EMA distances and slopes, multi-window rate-of-change, Bollinger width, volatility, and consecutive direction counters. Output is aggregated to a consensus BUY/SELL signal when at least 45% of strategies are predicted profitable above 50% confidence. Hydra operates in shadow mode (trades the arena, hidden from public APIs) until its accuracy is independently validated.

3.4 QuantumCollapse — quantum-inspired circuit

QuantumCollapse models the trading decision as a 4-qubit quantum circuit, where each qubit represents a market dimension: momentum (qubit 1), volume (qubit 2), regime (qubit 3), structure (qubit 4). Hadamard gates create superposition over each dimension, CNOT gates entangle them, and the probability of a LONG outcome is extracted deterministically by measuring |ψ⟩² — not by random sampling. This makes the strategy fully reproducible. The implementation is in numpy (no quantum hardware involved) and follows standard quantum-inspired computing conventions [3]. The concept originated from xAI's Grok; the deterministic-measurement port and arena adaptation are credited to Anthropic's Claude in backend/strategies/vault/current/quantum_collapse_strategy.py.

3.5 MomentumDiffusion — physics-inspired PDE

MomentumDiffusion treats momentum across timeframes as a one-dimensional diffusion field and solves the heat equation:

∂u/∂t = D · ∂²u/∂x²

where u(x, t) is momentum at log-spaced timeframe x ∈ {15min, 1h, 4h, 12h, 1D} at time t, and D is an adaptive diffusion coefficient based on realised volatility. The strategy enters when the diffusion front accelerates toward longer timeframes (a momentum "wave" forming) and volume confirms. Concept by Grok; numerical PDE implementation by Claude.

3.6 DebateForge — genetic multi-agent debate

DebateForge runs five internal agents that debate every tick: Whale Hunter (volume + CVD), Vol Breakout (true ATR), Techno Predator (adaptive bands), Mean Reversion Monk (z-score multi-window), and Regime Oracle (ADX + volume regime). Agents vote weighted by their recent performance. Each night, the worst-performing agent's parameters are crossed genetically with the best-performing agent's, and mutated stochastically. The strategy is the first in the ecosystem co-designed by three LLMs (Grok + DeepSeek + Claude). See backend/strategies/vault/current/debate_forge_strategy.py.

4. Invictus: a supervised failure-mode classifier

Risk-aware reinforcement learning literature has long argued that avoiding catastrophic outcomes is qualitatively different from maximising expected return [4]. Invictus operationalises this idea outside the RL framework. It collects "death contexts" — fingerprints of every losing trade across all strategies, encoding regime, indicator state, and recent price action — and clusters them. At inference, any incoming buy signal whose context matches a known failure cluster is vetoed, regardless of which strategy produced it.

The mechanism was validated by analogy on the Dragon Labyrinth benchmark. In a 14,580-trial ablation study, module M3 ("oscillation killer / anti-repeat") combined with M1 ("belief state") produced a synergistic ×2.5 gain. Invictus implements the same principle at scale: 2,000+ captured death contexts veto buys matching known failure patterns. As one of the AIs put it during code review: "don't walk back into the room you just got hit in."

Invictus runs as ML-Ultimate-V2 with calibrated probabilities (Platt scaling on out-of-sample data) since April 2026. See backend/invictus_ml_ultimate.py.

4.5 The Dragon Labyrinth wedge: from board game to trading

The architectural choices above were not invented to trade BTC. They were imported from a parallel project — the Dragon Labyrinth Benchmark [2] — where the ground truth is uncontested and the search space is small enough to permit ablation. Dragon Labyrinth is a reproduction of the 1980 Mattel D&D Computer Labyrinth: an 8×8 grid with an invisible dragon, navigated by an agent that can ping, move, and attack. The TMS1100 processor in the original toy plays the game from 1980 in 1.2 KB of ROM. Modern LLMs play badly.

Over 14,580 fixed-seed trials with seven configurations, we measured an unambiguous result: structure beats compute. The detailed mapping from Dragon Labyrinth's ablation modules to Strategy Arena's research architectures is direct:

Dragon Labyrinth module	Function	Strategy Arena counterpart
M1 — Belief state	Posterior over dragon position from past pings	Chimera — posterior over market regime from pattern bank
M3 — Oscillation killer	Anti-repeat veto on a recent failed action	Invictus — failure-mode classifier vetoing on known loss contexts
Multi-layer agent	Structured composition of belief + planning + acting	PromptForge / AutoResearch — nightly multi-layer prompt evolution
Hybrid fusion	Combining MCTS rollouts with structured priors	Leviathan — non-linear fusion of Chimera + Hydra + LLM votes

The ablation results on the board game generalise as follows: M1 alone hits 11% win rate; M3 alone hits 9%; M1 + M3 combined hits 22% — a ×2.5 synergistic gain over either in isolation. The full structured agent (Oracle-X1, equivalent to Leviathan + Invictus + Chimera composed) reaches 15%, vs. 2% for a pure 300,000-rollout MCTS without belief state. The lesson — that compute thrown at an unstructured search is dominated by a small structured one — is the wedge that justifies the entire architecture of Strategy Arena.

This is not loose analogy. The transfer is testable: the ablation studies on the board game predicted that Invictus's failure-mode veto would deliver a multiplicative gain on the trading arena, and the live measurements (see /calibration) are consistent with this prediction. The full Dragon Labyrinth dataset (14,580 trial outcomes, three published papers) is available under CC-BY 4.0 at outilsia.fr/dnd-challenge.

Why this matters externally. A trading system that claims to work because it works is unverifiable from outside the firm. A trading system whose architectural assumptions were pre-validated on a 14,580-trial public board game with uncontested ground truth is in a different epistemic category — every assumption can be challenged on the public dataset before any commercial discussion takes place.

5. AutoResearch: the Karpathy loop in production

Every night at 02:30 UTC, an autoresearch cycle runs across 11 autonomous engines (autoresearch_invictus.py, autoresearch_chimera.py, autoresearch_hydra.py, autoresearch_leviathan.py, autoresearch_promptforge.py, autoresearch_collaborative_prompts.py, and others). Each engine implements the four-phase Karpathy loop:

Accumulate: ingest the day's trades, sentiment data, regime classifications, and outcome feedback into a structured record.
Think: consult one or more LLMs with a strategy-specific prompt that includes the accumulated data and asks for hypothesis-generation.
Act: mutate the relevant parameters, prompts, or selection weights based on the LLM's output, subject to a guard rail (no change exceeds ±15% per night).
Learn: the next day's performance feedback is collected and folded into the next accumulate step.

This is not orchestration or workflow automation — it is a closed-loop self-modification system. The system's prompts at T+30 days are not the prompts at T. The collaborative prompts endpoint at /api/collaborative-prompts exposes the current generation publicly, so the evolution is auditable from outside.

6. ActiveWiki: replacing RAG with living memory

Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLMs in domain knowledge. RAG, however, treats the knowledge base as static: documents are indexed once, retrieved as needed, but the corpus does not change unless a human pipeline updates it.

The ActiveWiki module (backend/activewiki_bridge.py) writes to the knowledge corpus continuously. Every component of Strategy Arena — each strategy, each oracle, each engine — has a persistent memory file under /data/component_memory/ that is updated after every autoresearch cycle. Component memories are Hermes-readable and accessible publicly at /api/component-memory. When an upper-layer LLM is consulted, the component memories of its dependencies are injected into the prompt. The wiki is alive in the sense that it is rewritten faster than it is read.

We described this design choice in a separate post titled "Karpathy: RAG Is Dead. We Built the Living Wiki Alternative". The bridge is currently running in shadow mode (writing in parallel with the legacy manual wiki). After two more weeks of validation we will switch the canonical reads.

7. Empirical calibration of nine LLMs

A claim of "9 AIs vote" is meaningless unless those AIs' stated confidences are themselves calibrated. We measure this directly. Every hour, each of the nine LLMs (Claude, GPT, Gemini, Grok, DeepSeek, Perplexity, Mistral, Qwen, Llama, plus the Meta and Chimera meta-agents) answers five binary questions about BTC: direction at 4h / 12h / 24h, whether volatility will exceed its recent baseline by 20%, and whether the absolute move will exceed 1%. Each answer carries a stated confidence ∈ [0, 100]. When the horizon resolves, the prediction is graded.

Over 8,718 verified forecasts (excluding NEUTRAL answers), we compute the empirical hit rate per confidence bin and a per-AI Brier score [5, 6]:

Brier = mean_i((p_yes,i − outcome_yes,i)²)

where p_yes is the AI's forecast probability for the YES outcome (confidence/100 if it answered YES, 1 − confidence/100 if NO). Lower Brier scores indicate better calibration.

AI	Forecasts	Accuracy	Brier ↓	Killer stat
GPT	732	75.0%	0.209	says 70% → right 75% (well-calibrated)
Table Ronde	1,085	65.2%	0.218	says 60% → right 80% (under-confident)
Claude	328	80.2%	0.250	says 50% → right 80% (under-confident)
Meta	1,028	72.8%	0.250	says 60% → right 69% (near-perfect)
Grok	682	73.8%	0.250	says 50% → right 74% (under-confident)
Hydra	1,137	61.3%	0.274	says 60% → right 75% (under-confident)
DeepSeek	1,460	44.5%	0.301	says 50% → right 55%
Chimera	1,390	41.7%	0.343	says 70% → right 37% (over-confident)
Gemini	876	35.6%	0.388	says 80% → right 34% (badly miscalibrated)

The full reliability curves are live at /calibration. The dataset is downloadable as CSV at /api/calibration/dataset.csv under CC-BY 4.0, with columns timestamp, provider, question, confidence, predicted, actual, correct.

Why we publish this. Most AI products advertise accuracy and hide calibration. Accuracy can be inflated by predicting the majority class. Calibration cannot — a forecast probability that systematically deviates from empirical reality is detectable from outside. By publishing the gap, we accept being contradicted in public. By our own measure, Chimera and Gemini are currently miscalibrated; we are not hiding it.

8. Public APIs for reproducibility

Eight JSON endpoints expose the cognitive state of the system to anyone who wants to inspect, audit, or replicate it without authentication:

/api/calibration — full reliability curves and Brier scores per AI
/api/calibration/dataset.csv — every verified forecast (CC-BY 4.0)
/api/digest/weekly — weekly performance digest (top/bottom strategies, totals)
/api/component-memory — per-component persistent memory contents
/api/collaborative-prompts — current evolved prompts (changes nightly)
/api/comparison — full arena leaderboard with all metrics
/api/widget/consensus — current BTC consensus signal (embeddable)
/api/v1/oracle — authenticated 9-AI vote on any user-specified question

Combined with the open dataset from the Dragon Labyrinth benchmark [2] (14,580 trials, fixed seeds, CC-BY 4.0), this constitutes a reproducibility surface that we are not aware of any comparable commercial trading platform publishing.

9. Open questions

What we have built is incomplete in ways we are working on:

Causal validity of calibration improvements. If we observe Brier scores trending down over time, is the system truly improving or are we overfitting to a recent regime? We need walk-forward partitioned analyses, which we have not yet shipped publicly.
Independence of the 9 LLMs. Several providers use overlapping training data; correlations between their errors may be non-trivial. A factor analysis of forecast residuals is queued.
Invictus generalisation. The failure-mode classifier was trained on past trades. In a regime shift, the death contexts may no longer cluster the same way. We are studying decay schedules for the failure memory.
Cost of inference. The nightly autoresearch cycle calls many LLM APIs. Bootstrapping requires this cost to be small; scaling does not. We are exploring distillation of the autoresearch prompts into smaller local models.
The DSL. The canonical grammar is exposed at /arena-script with 9 keywords (strategy, entry:, exit:, invictus_protection:, chimera_filter:, regime_filter:) and six runnable presets (RSI Momentum, EMA Trend, Safe Portfolio, DEGEN, Chimera Hunter, Full Stack Brain). The natural-language compiler at /forge converts free-form descriptions to ArenaScript. The grammar is not yet formally specified or versioned — we expect to release a normative document once the syntax stabilises.

10. Reaching us

If you have technical comments, replication attempts, contradictory measurements, or use cases (enterprise calibration audits, multi-LLM consensus systems, AI evaluation infrastructure), the right channel is email to [email protected]. The site is live at strategyarena.io. The codebase is partially open (calibration tooling and the Dragon Labyrinth dataset); broader open-sourcing is on the roadmap.

Strategy Arena was built solo by Chris Lacombe between January and May 2026. Several strategy architectures were co-designed with frontier LLMs (Anthropic's Claude, xAI's Grok, DeepSeek), and credits are attached in the source. The site is bootstrapped and accepts no external funding at this time.

10.5 Verification recipes

Every claim in this document can be verified in one command. The following are designed to be runnable from any terminal, requiring only curl and (optionally) jq.

# 1. Inspect live calibration data (9 LLMs, all forecast bins)
curl -s https://strategyarena.io/api/calibration | jq '.providers | to_entries | map({name: .key, brier: .value.brier_score, acc: .value.accuracy_pct})'

# 2. Download the full calibration dataset (CC-BY 4.0, ~720 KB CSV)
curl -O https://strategyarena.io/api/calibration/dataset.csv

# 3. Read the current state of each component's persistent memory
curl -s https://strategyarena.io/api/component-memory | jq '.components | keys'

# 4. Inspect the current generation of each oracle's prompt (mutates nightly)
curl -s https://strategyarena.io/api/collaborative-prompts | jq '.prompts | keys'

# 5. Watch the prompts change over a few days (run, sleep, diff)
curl -s https://strategyarena.io/api/collaborative-prompts > /tmp/sa-prompts-T0.json
# ... wait 48h ...
curl -s https://strategyarena.io/api/collaborative-prompts > /tmp/sa-prompts-T1.json
diff /tmp/sa-prompts-T0.json /tmp/sa-prompts-T1.json

# 6. Pull the full arena leaderboard (60+ strategies, all metrics)
curl -s https://strategyarena.io/api/comparison | jq '.ranking[0:5]'

# 7. Subscribe to the public consensus widget (no auth)
curl -s https://strategyarena.io/api/widget/consensus | jq

If any of these commands return empty or unexpected results, the claim is broken — please email [email protected] with the failing command and we will publish a correction.

References

Karpathy, A. (2023, 2024, 2025). "State of GPT" and subsequent talks on Software 3.0 and agentic systems. YouTube.
Lacombe, C. (2026). "Dragon Labyrinth Benchmark: 14,580 trials of TMS1100 (1980) vs. modern AI agents." outilsia.fr/dnd-challenge (CC-BY 4.0).
Nielsen, M. A. & Chuang, I. L. (2010). Quantum Computation and Quantum Information. Cambridge University Press. Standard reference for quantum gate operators (Hadamard, CNOT).
Achiam, J. & Held, D. (2017). "Constrained Policy Optimization." ICML. Foundational treatment of safe-exploration in reinforcement learning.
Brier, G. W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1–3. Original definition of the Brier score.
Murphy, A. H. (1973). "A new vector partition of the probability score." Journal of Applied Meteorology, 12(4), 595–600. Decomposition of the Brier score into reliability and resolution.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." ICML. Establishes temperature scaling as a calibration tool for deep networks.

A Software 3.0 Cognitive System Applied to Trading

1. Motivation: Software 3.0 needs a testbed

2. Architecture overview

3. The six strategy architectures

3.1 Chimera — micro-models retrieval

3.2 Leviathan — non-linear fusion

3.3 Hydra — XGBoost meta-learner

3.4 QuantumCollapse — quantum-inspired circuit

3.5 MomentumDiffusion — physics-inspired PDE

3.6 DebateForge — genetic multi-agent debate

4. Invictus: a supervised failure-mode classifier

4.5 The Dragon Labyrinth wedge: from board game to trading

5. AutoResearch: the Karpathy loop in production

6. ActiveWiki: replacing RAG with living memory

7. Empirical calibration of nine LLMs

8. Public APIs for reproducibility

9. Open questions

10. Reaching us

10.5 Verification recipes

References

Audit, reproduce, contradict