Strategy Arena Research · Paper 4

ActiveWiki: A RAG-Augmented POMDP Framework for Cognitive Benchmarking

Generalizing Dragon Labyrinth insights from a solved-case wiki to ensemble decision systems. arXiv submission is being prepared.

Read on site Download TeX arXiv bundle ZIP All papers Source article arXiv (preparing submission)

Abstract

We introduce ActiveWiki, a retrieval-augmented case-memory framework for partially observable decision processes. Instead of retrieving text passages to support a language model response, ActiveWiki retrieves solved episodes from a continuously growing wiki of environment instances. The framework is evaluated on Dragon Labyrinth, a compact POMDP benchmark derived from a 1980 Mattel TMS1100 game and used in prior Strategy Arena research. Oracle-X1, a structured belief-state agent, reaches 14.2% win rate on a 1,000-seed paired benchmark. Oracle-X2, augmented with ActiveWiki priors built from 5,000 logged games and 8 k-means topology clusters, reaches 13.8%. On raw win rate, the wiki appears neutral. However, paired analysis shows that the two agents win on substantially different seeds: 65 shared wins, 77 X1-only wins, and 73 X2-only wins. The union reaches 21.5%, above the approximate human expert ceiling. The result suggests that active case memory can be valuable not as a monotonic single-agent score booster, but as a decorrelating module for ensembles under partial observability.

1. Introduction

RAG systems are usually discussed as text retrieval systems for language models. ActiveWiki shifts the unit of retrieval from documents to solved decision cases. In a POMDP, an agent never sees the full state. A memory of solved partial-observation cases can therefore provide a prior over hidden structure, but it may also bias the agent away from robust baseline reasoning. The useful question is not only whether the augmented agent wins more often, but whether it wins different cases.

2. Background

Dragon Labyrinth provides a small but unforgiving testbed: hidden treasure, a moving adversary, sparse observations and irreversible mistakes. Previous work showed that structure can outperform brute compute and that a 4-bit toy-era policy can remain competitive against modern frontier models on this class of task. ActiveWiki extends that line by testing whether solved-case retrieval changes the error surface.

3. Method

Oracle-X1 logs 5,000 games. For each run, the system records the seed, eight topology features, the real treasure position, outcome and trajectory. K-means clusters normalized topology features into eight groups. Each cluster stores a treasure heatmap and winning action patterns. At runtime, Oracle-X2 assigns the current labyrinth to its nearest cluster and injects the retrieved heatmap as a soft prior into the belief state and mode selector.

5,000training episodes

8topology clusters

1,000paired A/B seeds

21.5%ensemble upper-bound win rate

4. Results

Outcome	Oracle-X1	Oracle-X2 ActiveWiki
Wins	142 (14.2%)	138 (13.8%)
Death, 3 hits	676	677
Death with treasure	138	143
Timeout	44	42

Paired seed class	Count	Share
Both win	65	6.5%
Only X1 wins	77	7.7%
Only X2 wins	73	7.3%
Neither wins	785	78.5%
Union	215	21.5%

5. Discussion

The central finding is differential value. ActiveWiki does not improve the standalone score, but it decorrelates failure modes enough to matter in ensemble form. This mirrors random forests, bagging and trading strategy portfolios: orthogonal errors can be more valuable than a small isolated accuracy gain.

6. Conclusion

ActiveWiki should be evaluated as an ensemble component, not merely as an individual agent upgrade. Future work will replace the oracle union with an online selector that chooses between X1 and X2 from topology features before the game unfolds.

References

Strategy Arena Research. Dragon Labyrinth Benchmark, 2026.
Tendil, L. and Strategy Arena Research. Calibration as Capital, Refusal as Information, 2026.
OutilsIA. Active Wiki RAG pour POMDP, 2026.