A Software 3.0 Cognitive System Applied to Trading
1. Motivation: Software 3.0 needs a testbed
In State of GPT (2023) and subsequent talks, Andrej Karpathy proposed that large language models constitute a third programming paradigm — alongside Software 1.0 (hand-written code) and Software 2.0 (learned weights). In Software 3.0, programs are written in natural language, executed by LLMs, and composed into agentic loops that accumulate, think, act, and learn [1]. Trading is an unusually clean testbed for this idea, because outcomes resolve in hours, capital is virtual (paper trading carries zero counterparty risk), and the ground truth — did the price go up or not — admits no interpretation.
We built Strategy Arena over four months as a solo bootstrapped project to test how far Software 3.0 ideas can be pushed in a domain that punishes mistakes. The result is not a trading bot. It is a cognitive system with its own domain-specific language, its own institutional memory, its own immune system against repeating failures, and an audit surface that exposes every internal state through public APIs.
This document is the first written description of what was built. The site has been positioned publicly as a cryptocurrency AI trading platform, but that framing under-describes the artifact. The audience we are trying to reach with this document is researchers, engineers, and observers who work on multi-LLM systems, calibration, and agentic loops — not the retail crypto trader.
2. Architecture overview
The system has four levels stacked above the trading arena:
Decisions flow upward (arena strategies emit signals, meta layers aggregate). Memory flows downward (AutoResearch and ActiveWiki rewrite the prompts, parameters, and selection weights of the layers below). Invictus operates as a veto in parallel: any buy signal that matches a previously-clustered failure context is blocked, regardless of how confident the upper layers were.
3. The six strategy architectures
Most of the 60+ strategies in the arena are conventional: buy-and-hold, DCA, RSI mean-reversion, Donchian channels, etc. Six are unusual enough to warrant individual description, because each implements a distinct scientific thesis.
3.1 Chimera — micro-models retrieval
Chimera maintains a bank of 1,221 distinct market patterns extracted from historical data, each indexed semantically against the performance of the 1,196 strategies that previously matched it. At inference time, the current price action is converted into a feature vector and compared by cosine similarity against the entire bank:
The matched pattern's historical winning strategy is then promoted as the "chameleon" signal for the current tick. Chimera contains 51 million indexed pattern occurrences across the historical archive, occupying ~59 MB. This is consistent with Karpathy's micro-models advocacy: rather than one large model attempting to predict, 1,196 small specialised models are retrieved by similarity. See backend/chimera/ in the source for the indexing pipeline.
3.2 Leviathan — non-linear fusion
Leviathan ingests Chimera's pattern signal, Hydra's meta-learner output, the votes of six LLMs, news sentiment, and a contrarian filter, then produces a single BUY/SELL/HOLD decision via a learned non-linear function. Its formula in current form is:
Leviathan was validated indirectly on the Dragon Labyrinth Benchmark [2], a published reproduction of the 1980 Mattel D&D Computer Labyrinth board game played by both modern AIs and the original 4-bit TMS1100 processor. In a 14,580-trial grid search, a pure brute-force MCTS plateaued at 2% win rate; a structured multi-layer agent (Oracle-X1) reached 15%, a 7.5× outperformance. Leviathan applies the same lesson: structure beats compute.
3.3 Hydra — XGBoost meta-learner
Hydra trains an XGBoost classifier on the historical trade outcomes of every strategy in the arena. At each tick, it asks: "if each strategy were to trade now, would it be profitable?" The 16 input features include RSI, ATR%, EMA distances and slopes, multi-window rate-of-change, Bollinger width, volatility, and consecutive direction counters. Output is aggregated to a consensus BUY/SELL signal when at least 45% of strategies are predicted profitable above 50% confidence. Hydra operates in shadow mode (trades the arena, hidden from public APIs) until its accuracy is independently validated.
3.4 QuantumCollapse — quantum-inspired circuit
QuantumCollapse models the trading decision as a 4-qubit quantum circuit, where each qubit represents a market dimension: momentum (qubit 1), volume (qubit 2), regime (qubit 3), structure (qubit 4). Hadamard gates create superposition over each dimension, CNOT gates entangle them, and the probability of a LONG outcome is extracted deterministically by measuring |ψ⟩² — not by random sampling. This makes the strategy fully reproducible. The implementation is in numpy (no quantum hardware involved) and follows standard quantum-inspired computing conventions [3]. The concept originated from xAI's Grok; the deterministic-measurement port and arena adaptation are credited to Anthropic's Claude in backend/strategies/vault/current/quantum_collapse_strategy.py.
3.5 MomentumDiffusion — physics-inspired PDE
MomentumDiffusion treats momentum across timeframes as a one-dimensional diffusion field and solves the heat equation:
where u(x, t) is momentum at log-spaced timeframe x ∈ {15min, 1h, 4h, 12h, 1D} at time t, and D is an adaptive diffusion coefficient based on realised volatility. The strategy enters when the diffusion front accelerates toward longer timeframes (a momentum "wave" forming) and volume confirms. Concept by Grok; numerical PDE implementation by Claude.
3.6 DebateForge — genetic multi-agent debate
DebateForge runs five internal agents that debate every tick: Whale Hunter (volume + CVD), Vol Breakout (true ATR), Techno Predator (adaptive bands), Mean Reversion Monk (z-score multi-window), and Regime Oracle (ADX + volume regime). Agents vote weighted by their recent performance. Each night, the worst-performing agent's parameters are crossed genetically with the best-performing agent's, and mutated stochastically. The strategy is the first in the ecosystem co-designed by three LLMs (Grok + DeepSeek + Claude). See backend/strategies/vault/current/debate_forge_strategy.py.
4. Invictus: a supervised failure-mode classifier
Risk-aware reinforcement learning literature has long argued that avoiding catastrophic outcomes is qualitatively different from maximising expected return [4]. Invictus operationalises this idea outside the RL framework. It collects "death contexts" — fingerprints of every losing trade across all strategies, encoding regime, indicator state, and recent price action — and clusters them. At inference, any incoming buy signal whose context matches a known failure cluster is vetoed, regardless of which strategy produced it.
The mechanism was validated by analogy on the Dragon Labyrinth benchmark. In a 14,580-trial ablation study, module M3 ("oscillation killer / anti-repeat") combined with M1 ("belief state") produced a synergistic ×2.5 gain. Invictus implements the same principle at scale: 2,000+ captured death contexts veto buys matching known failure patterns. As one of the AIs put it during code review: "don't walk back into the room you just got hit in."
Invictus runs as ML-Ultimate-V2 with calibrated probabilities (Platt scaling on out-of-sample data) since April 2026. See backend/invictus_ml_ultimate.py.
4.5 The Dragon Labyrinth wedge: from board game to trading
The architectural choices above were not invented to trade BTC. They were imported from a parallel project — the Dragon Labyrinth Benchmark [2] — where the ground truth is uncontested and the search space is small enough to permit ablation. Dragon Labyrinth is a reproduction of the 1980 Mattel D&D Computer Labyrinth: an 8×8 grid with an invisible dragon, navigated by an agent that can ping, move, and attack. The TMS1100 processor in the original toy plays the game from 1980 in 1.2 KB of ROM. Modern LLMs play badly.
Over 14,580 fixed-seed trials with seven configurations, we measured an unambiguous result: structure beats compute. The detailed mapping from Dragon Labyrinth's ablation modules to Strategy Arena's research architectures is direct:
| Dragon Labyrinth module | Function | Strategy Arena counterpart |
|---|---|---|
| M1 — Belief state | Posterior over dragon position from past pings | Chimera — posterior over market regime from pattern bank |
| M3 — Oscillation killer | Anti-repeat veto on a recent failed action | Invictus — failure-mode classifier vetoing on known loss contexts |
| Multi-layer agent | Structured composition of belief + planning + acting | PromptForge / AutoResearch — nightly multi-layer prompt evolution |
| Hybrid fusion | Combining MCTS rollouts with structured priors | Leviathan — non-linear fusion of Chimera + Hydra + LLM votes |
The ablation results on the board game generalise as follows: M1 alone hits 11% win rate; M3 alone hits 9%; M1 + M3 combined hits 22% — a ×2.5 synergistic gain over either in isolation. The full structured agent (Oracle-X1, equivalent to Leviathan + Invictus + Chimera composed) reaches 15%, vs. 2% for a pure 300,000-rollout MCTS without belief state. The lesson — that compute thrown at an unstructured search is dominated by a small structured one — is the wedge that justifies the entire architecture of Strategy Arena.
This is not loose analogy. The transfer is testable: the ablation studies on the board game predicted that Invictus's failure-mode veto would deliver a multiplicative gain on the trading arena, and the live measurements (see /calibration) are consistent with this prediction. The full Dragon Labyrinth dataset (14,580 trial outcomes, three published papers) is available under CC-BY 4.0 at outilsia.fr/dnd-challenge.
5. AutoResearch: the Karpathy loop in production
Every night at 02:30 UTC, an autoresearch cycle runs across 11 autonomous engines (autoresearch_invictus.py, autoresearch_chimera.py, autoresearch_hydra.py, autoresearch_leviathan.py, autoresearch_promptforge.py, autoresearch_collaborative_prompts.py, and others). Each engine implements the four-phase Karpathy loop:
- Accumulate: ingest the day's trades, sentiment data, regime classifications, and outcome feedback into a structured record.
- Think: consult one or more LLMs with a strategy-specific prompt that includes the accumulated data and asks for hypothesis-generation.
- Act: mutate the relevant parameters, prompts, or selection weights based on the LLM's output, subject to a guard rail (no change exceeds ±15% per night).
- Learn: the next day's performance feedback is collected and folded into the next accumulate step.
This is not orchestration or workflow automation — it is a closed-loop self-modification system. The system's prompts at T+30 days are not the prompts at T. The collaborative prompts endpoint at /api/collaborative-prompts exposes the current generation publicly, so the evolution is auditable from outside.
6. ActiveWiki: replacing RAG with living memory
Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLMs in domain knowledge. RAG, however, treats the knowledge base as static: documents are indexed once, retrieved as needed, but the corpus does not change unless a human pipeline updates it.
The ActiveWiki module (backend/activewiki_bridge.py) writes to the knowledge corpus continuously. Every component of Strategy Arena — each strategy, each oracle, each engine — has a persistent memory file under /data/component_memory/ that is updated after every autoresearch cycle. Component memories are Hermes-readable and accessible publicly at /api/component-memory. When an upper-layer LLM is consulted, the component memories of its dependencies are injected into the prompt. The wiki is alive in the sense that it is rewritten faster than it is read.
We described this design choice in a separate post titled "Karpathy: RAG Is Dead. We Built the Living Wiki Alternative". The bridge is currently running in shadow mode (writing in parallel with the legacy manual wiki). After two more weeks of validation we will switch the canonical reads.
7. Empirical calibration of nine LLMs
A claim of "9 AIs vote" is meaningless unless those AIs' stated confidences are themselves calibrated. We measure this directly. Every hour, each of the nine LLMs (Claude, GPT, Gemini, Grok, DeepSeek, Perplexity, Mistral, Qwen, Llama, plus the Meta and Chimera meta-agents) answers five binary questions about BTC: direction at 4h / 12h / 24h, whether volatility will exceed its recent baseline by 20%, and whether the absolute move will exceed 1%. Each answer carries a stated confidence ∈ [0, 100]. When the horizon resolves, the prediction is graded.
Over 8,718 verified forecasts (excluding NEUTRAL answers), we compute the empirical hit rate per confidence bin and a per-AI Brier score [5, 6]:
where pyes is the AI's forecast probability for the YES outcome (confidence/100 if it answered YES, 1 − confidence/100 if NO). Lower Brier scores indicate better calibration.
| AI | Forecasts | Accuracy | Brier ↓ | Killer stat |
|---|---|---|---|---|
| GPT | 732 | 75.0% | 0.209 | says 70% → right 75% (well-calibrated) |
| Table Ronde | 1,085 | 65.2% | 0.218 | says 60% → right 80% (under-confident) |
| Claude | 328 | 80.2% | 0.250 | says 50% → right 80% (under-confident) |
| Meta | 1,028 | 72.8% | 0.250 | says 60% → right 69% (near-perfect) |
| Grok | 682 | 73.8% | 0.250 | says 50% → right 74% (under-confident) |
| Hydra | 1,137 | 61.3% | 0.274 | says 60% → right 75% (under-confident) |
| DeepSeek | 1,460 | 44.5% | 0.301 | says 50% → right 55% |
| Chimera | 1,390 | 41.7% | 0.343 | says 70% → right 37% (over-confident) |
| Gemini | 876 | 35.6% | 0.388 | says 80% → right 34% (badly miscalibrated) |
The full reliability curves are live at /calibration. The dataset is downloadable as CSV at /api/calibration/dataset.csv under CC-BY 4.0, with columns timestamp, provider, question, confidence, predicted, actual, correct.
8. Public APIs for reproducibility
Eight JSON endpoints expose the cognitive state of the system to anyone who wants to inspect, audit, or replicate it without authentication:
/api/calibration— full reliability curves and Brier scores per AI/api/calibration/dataset.csv— every verified forecast (CC-BY 4.0)/api/digest/weekly— weekly performance digest (top/bottom strategies, totals)/api/component-memory— per-component persistent memory contents/api/collaborative-prompts— current evolved prompts (changes nightly)/api/comparison— full arena leaderboard with all metrics/api/widget/consensus— current BTC consensus signal (embeddable)/api/v1/oracle— authenticated 9-AI vote on any user-specified question
Combined with the open dataset from the Dragon Labyrinth benchmark [2] (14,580 trials, fixed seeds, CC-BY 4.0), this constitutes a reproducibility surface that we are not aware of any comparable commercial trading platform publishing.
9. Open questions
What we have built is incomplete in ways we are working on:
- Causal validity of calibration improvements. If we observe Brier scores trending down over time, is the system truly improving or are we overfitting to a recent regime? We need walk-forward partitioned analyses, which we have not yet shipped publicly.
- Independence of the 9 LLMs. Several providers use overlapping training data; correlations between their errors may be non-trivial. A factor analysis of forecast residuals is queued.
- Invictus generalisation. The failure-mode classifier was trained on past trades. In a regime shift, the death contexts may no longer cluster the same way. We are studying decay schedules for the failure memory.
- Cost of inference. The nightly autoresearch cycle calls many LLM APIs. Bootstrapping requires this cost to be small; scaling does not. We are exploring distillation of the autoresearch prompts into smaller local models.
- The DSL. The canonical grammar is exposed at
/arena-scriptwith 9 keywords (strategy,entry:,exit:,invictus_protection:,chimera_filter:,regime_filter:) and six runnable presets (RSI Momentum, EMA Trend, Safe Portfolio, DEGEN, Chimera Hunter, Full Stack Brain). The natural-language compiler at/forgeconverts free-form descriptions to ArenaScript. The grammar is not yet formally specified or versioned — we expect to release a normative document once the syntax stabilises.
10. Reaching us
If you have technical comments, replication attempts, contradictory measurements, or use cases (enterprise calibration audits, multi-LLM consensus systems, AI evaluation infrastructure), the right channel is email to [email protected]. The site is live at strategyarena.io. The codebase is partially open (calibration tooling and the Dragon Labyrinth dataset); broader open-sourcing is on the roadmap.
Strategy Arena was built solo by Chris Lacombe between January and May 2026. Several strategy architectures were co-designed with frontier LLMs (Anthropic's Claude, xAI's Grok, DeepSeek), and credits are attached in the source. The site is bootstrapped and accepts no external funding at this time.
10.5 Verification recipes
Every claim in this document can be verified in one command. The following are designed to be runnable from any terminal, requiring only curl and (optionally) jq.
# 1. Inspect live calibration data (9 LLMs, all forecast bins)
curl -s https://strategyarena.io/api/calibration | jq '.providers | to_entries | map({name: .key, brier: .value.brier_score, acc: .value.accuracy_pct})'
# 2. Download the full calibration dataset (CC-BY 4.0, ~720 KB CSV)
curl -O https://strategyarena.io/api/calibration/dataset.csv
# 3. Read the current state of each component's persistent memory
curl -s https://strategyarena.io/api/component-memory | jq '.components | keys'
# 4. Inspect the current generation of each oracle's prompt (mutates nightly)
curl -s https://strategyarena.io/api/collaborative-prompts | jq '.prompts | keys'
# 5. Watch the prompts change over a few days (run, sleep, diff)
curl -s https://strategyarena.io/api/collaborative-prompts > /tmp/sa-prompts-T0.json
# ... wait 48h ...
curl -s https://strategyarena.io/api/collaborative-prompts > /tmp/sa-prompts-T1.json
diff /tmp/sa-prompts-T0.json /tmp/sa-prompts-T1.json
# 6. Pull the full arena leaderboard (60+ strategies, all metrics)
curl -s https://strategyarena.io/api/comparison | jq '.ranking[0:5]'
# 7. Subscribe to the public consensus widget (no auth)
curl -s https://strategyarena.io/api/widget/consensus | jq
If any of these commands return empty or unexpected results, the claim is broken — please email [email protected] with the failing command and we will publish a correction.
References
- Karpathy, A. (2023, 2024, 2025). "State of GPT" and subsequent talks on Software 3.0 and agentic systems. YouTube.
- Lacombe, C. (2026). "Dragon Labyrinth Benchmark: 14,580 trials of TMS1100 (1980) vs. modern AI agents." outilsia.fr/dnd-challenge (CC-BY 4.0).
- Nielsen, M. A. & Chuang, I. L. (2010). Quantum Computation and Quantum Information. Cambridge University Press. Standard reference for quantum gate operators (Hadamard, CNOT).
- Achiam, J. & Held, D. (2017). "Constrained Policy Optimization." ICML. Foundational treatment of safe-exploration in reinforcement learning.
- Brier, G. W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1–3. Original definition of the Brier score.
- Murphy, A. H. (1973). "A new vector partition of the probability score." Journal of Applied Meteorology, 12(4), 595–600. Decomposition of the Brier score into reliability and resolution.
- Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." ICML. Establishes temperature scaling as a calibration tool for deep networks.