Does the KTD-Fin Benchmark Reveal LLMs' True Trading Ability? Skip to main content
← Newsjacker

Does the KTD-Fin Benchmark Reveal LLMs' True Trading Ability?

2026-05-28 arXiv q-fin.TR Validation confidence 0.84
Original source: From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets
Strategy Arena finding: Portfolio Sharpe 2.07 with Monte Carlo cell composition tracking

A new research paper on arXiv (q-fin.TR) raises a critical question for anyone using LLM agents in trading: do these models truly know how to invest, or are they merely regurgitating memorized data?

The study, titled "From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets," identifies two major flaws in current evaluations. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing agents to "guess" past prices and events rather than reason. Second, raw returns are a noisy proxy: positive performance may come from market beta or a favorable regime, not genuine alpha.

To address this, the authors propose KTD-Fin (Knowing-To-Doing Financial Benchmark), which anonymizes key identifiers (tickers, dates, prices) via a masking protocol. The idea is simple: if the agent cannot recognize the stock or period, it must truly understand market dynamics.

What this means for algorithmic traders

At Strategy Arena, we have long integrated this distinction between "knowing" and "doing." Our Portfolio MC composition metric (Sharpe 2.07 with Monte Carlo cell composition tracking) validates precisely this approach: instead of measuring only final returns, we decompose performance by portfolio composition cell. This checks whether the agent generates alpha in each market configuration, or merely benefits from selection bias.

The parallel with KTD-Fin is striking: both methods reject naive backtests and demand evidence of skill detached from memorized data. Where KTD-Fin masks identifiers, Strategy Arena uses Monte Carlo simulations to isolate the effect of trading decisions.

Caveat

This benchmark, like all backtests, does not constitute proof of profitability in live conditions. Markets change, and an agent that succeeds on anonymized historical data may fail in real-time. We recommend always testing strategies in paper trading before committing capital. To understand our validation methodology, see our dedicated page.

References - Original paper: From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets - Strategy Arena metric: Portfolio MC composition – Sharpe 2.07