ActiveWiki: A RAG-Augmented POMDP Framework for Cognitive Benchmarking
Generalizing Dragon Labyrinth insights from a solved-case wiki to ensemble decision systems. arXiv submission is being prepared.
Abstract
We introduce ActiveWiki, a retrieval-augmented case-memory framework for partially observable decision processes. Instead of retrieving text passages to support a language model response, ActiveWiki retrieves solved episodes from a continuously growing wiki of environment instances. The framework is evaluated on Dragon Labyrinth, a compact POMDP benchmark derived from a 1980 Mattel TMS1100 game and used in prior Strategy Arena research. Oracle-X1, a structured belief-state agent, reaches 14.2% win rate on a 1,000-seed paired benchmark. Oracle-X2, augmented with ActiveWiki priors built from 5,000 logged games and 8 k-means topology clusters, reaches 13.8%. On raw win rate, the wiki appears neutral. However, paired analysis shows that the two agents win on substantially different seeds: 65 shared wins, 77 X1-only wins, and 73 X2-only wins. The union reaches 21.5%, above the approximate human expert ceiling. The result suggests that active case memory can be valuable not as a monotonic single-agent score booster, but as a decorrelating module for ensembles under partial observability.
1. Introduction
RAG systems are usually discussed as text retrieval systems for language models. ActiveWiki shifts the unit of retrieval from documents to solved decision cases. In a POMDP, an agent never sees the full state. A memory of solved partial-observation cases can therefore provide a prior over hidden structure, but it may also bias the agent away from robust baseline reasoning. The useful question is not only whether the augmented agent wins more often, but whether it wins different cases.
2. Background
Dragon Labyrinth provides a small but unforgiving testbed: hidden treasure, a moving adversary, sparse observations and irreversible mistakes. Previous work showed that structure can outperform brute compute and that a 4-bit toy-era policy can remain competitive against modern frontier models on this class of task. ActiveWiki extends that line by testing whether solved-case retrieval changes the error surface.
3. Method
Oracle-X1 logs 5,000 games. For each run, the system records the seed, eight topology features, the real treasure position, outcome and trajectory. K-means clusters normalized topology features into eight groups. Each cluster stores a treasure heatmap and winning action patterns. At runtime, Oracle-X2 assigns the current labyrinth to its nearest cluster and injects the retrieved heatmap as a soft prior into the belief state and mode selector.
4. Results
| Outcome | Oracle-X1 | Oracle-X2 ActiveWiki |
|---|---|---|
| Wins | 142 (14.2%) | 138 (13.8%) |
| Death, 3 hits | 676 | 677 |
| Death with treasure | 138 | 143 |
| Timeout | 44 | 42 |
| Paired seed class | Count | Share |
|---|---|---|
| Both win | 65 | 6.5% |
| Only X1 wins | 77 | 7.7% |
| Only X2 wins | 73 | 7.3% |
| Neither wins | 785 | 78.5% |
| Union | 215 | 21.5% |
5. Discussion
The central finding is differential value. ActiveWiki does not improve the standalone score, but it decorrelates failure modes enough to matter in ensemble form. This mirrors random forests, bagging and trading strategy portfolios: orthogonal errors can be more valuable than a small isolated accuracy gain.
6. Conclusion
ActiveWiki should be evaluated as an ensemble component, not merely as an individual agent upgrade. Future work will replace the oracle union with an online selector that chooses between X1 and X2 from topology features before the game unfolds.
References
- Strategy Arena Research. Dragon Labyrinth Benchmark, 2026.
- Tendil, L. and Strategy Arena Research. Calibration as Capital, Refusal as Information, 2026.
- OutilsIA. Active Wiki RAG pour POMDP, 2026.