Skip to main content
Strategy Arena Research · Paper 4

ActiveWiki: A RAG-Augmented POMDP Framework for Cognitive Benchmarking

Generalizing Dragon Labyrinth insights from a solved-case wiki to ensemble decision systems. arXiv submission is being prepared.

Abstract

We introduce ActiveWiki, a retrieval-augmented case-memory framework for partially observable decision processes. Instead of retrieving text passages to support a language model response, ActiveWiki retrieves solved episodes from a continuously growing wiki of environment instances. The framework is evaluated on Dragon Labyrinth, a compact POMDP benchmark derived from a 1980 Mattel TMS1100 game and used in prior Strategy Arena research. Oracle-X1, a structured belief-state agent, reaches 14.2% win rate on a 1,000-seed paired benchmark. Oracle-X2, augmented with ActiveWiki priors built from 5,000 logged games and 8 k-means topology clusters, reaches 13.8%. On raw win rate, the wiki appears neutral. However, paired analysis shows that the two agents win on substantially different seeds: 65 shared wins, 77 X1-only wins, and 73 X2-only wins. The union reaches 21.5%, above the approximate human expert ceiling. The result suggests that active case memory can be valuable not as a monotonic single-agent score booster, but as a decorrelating module for ensembles under partial observability.

1. Introduction

RAG systems are usually discussed as text retrieval systems for language models. ActiveWiki shifts the unit of retrieval from documents to solved decision cases. In a POMDP, an agent never sees the full state. A memory of solved partial-observation cases can therefore provide a prior over hidden structure, but it may also bias the agent away from robust baseline reasoning. The useful question is not only whether the augmented agent wins more often, but whether it wins different cases.

2. Background

Dragon Labyrinth provides a small but unforgiving testbed: hidden treasure, a moving adversary, sparse observations and irreversible mistakes. Previous work showed that structure can outperform brute compute and that a 4-bit toy-era policy can remain competitive against modern frontier models on this class of task. ActiveWiki extends that line by testing whether solved-case retrieval changes the error surface.

3. Method

Oracle-X1 logs 5,000 games. For each run, the system records the seed, eight topology features, the real treasure position, outcome and trajectory. K-means clusters normalized topology features into eight groups. Each cluster stores a treasure heatmap and winning action patterns. At runtime, Oracle-X2 assigns the current labyrinth to its nearest cluster and injects the retrieved heatmap as a soft prior into the belief state and mode selector.

5,000training episodes
8topology clusters
1,000paired A/B seeds
21.5%ensemble upper-bound win rate

4. Results

OutcomeOracle-X1Oracle-X2 ActiveWiki
Wins142 (14.2%)138 (13.8%)
Death, 3 hits676677
Death with treasure138143
Timeout4442
Paired seed classCountShare
Both win656.5%
Only X1 wins777.7%
Only X2 wins737.3%
Neither wins78578.5%
Union21521.5%

5. Discussion

The central finding is differential value. ActiveWiki does not improve the standalone score, but it decorrelates failure modes enough to matter in ensemble form. This mirrors random forests, bagging and trading strategy portfolios: orthogonal errors can be more valuable than a small isolated accuracy gain.

6. Conclusion

ActiveWiki should be evaluated as an ensemble component, not merely as an individual agent upgrade. Future work will replace the oracle union with an online selector that chooses between X1 and X2 from topology features before the game unfolds.

References