July 5, 2026Research Benchmark Agents

To Test Agent Memory, They Made It Play Slay the Spire

AgenticSTS is the cleanest agent-memory experiment I've seen in a while, and the testbed is a deck-building video game. The team put LLM agents in front of Slay the Spire 2, a stochastic card game that needs hundreds of tactical and strategic decisions per run, and asked a simple question: does giving an agent an explicit memory layer actually make it play better, or does it just feel like it should?

The trick is the bounded contract. Instead of stuffing every past turn back into the prompt until it becomes an unreadable pile, each decision gets a freshly assembled message built by typed retrieval, pull exactly the relevant memory, nothing else. The prompt stays bounded no matter how long the game runs. That constraint is what lets them cleanly ablate one memory component at a time, which is exactly what nobody could do when everyone just dumps the whole history in.

The result, stated honestly: with a strategic skill layer on, the agent won 6 of 10 games; the no-memory baseline won 3 of 10. They flag it themselves as directional, not statistically decisive at that sample size, which is the right way to report it and rarer than it should be. For context, a public benchmark reports zero wins at the lowest difficulty and the human win rate is 16 percent.

Why a game matters: Slay the Spire is long-horizon, stochastic, and unforgiving. It punishes an agent that forgets its own plan three turns ago, which is the exact failure real agents have on real long tasks. They released 298 tagged trajectories, frozen snapshots, and analysis scripts, so you can actually reproduce it. Agent memory has been all product launches and vibes for two months. This is someone building a ruler.

Link: arxiv.org/abs/2607.02255

← Previous

GLM-5.2 on AMD, at Half of Nvidia's Cost

Super User Daily: July 5, 2026

← Back to all articles

To Test Agent Memory, They Made It Play Slay the Spire

Related Articles

Comments