To Test Agent Memory, They Made It Play Slay the Spire
AgenticSTS is the cleanest agent-memory experiment I've seen in a while, and the testbed is a deck-building video game. The team put LLM agents in front of Slay the Spire 2, a stochastic card game that needs hundreds of tactical and strategic decisions per run, and asked a simple question: does giving an agent an explicit memory layer actually make it play better, or does it just feel like it should?
The trick is the bounded contract. Instead of stuffing every past turn back into the prompt until it becomes an unreadable pile, each decision gets a freshly assembled message built by typed retrieval, pull exactly the relevant memory, nothing else. The prompt stays bounded no matter how long the game runs. That constraint is what lets them cleanly ablate one memory component at a time, which is exactly what nobody could do when everyone just dumps the whole history in.
The result, stated honestly: with a strategic skill layer on, the agent won 6 of 10 games; the no-memory baseline won 3 of 10. They flag it themselves as directional, not statistically decisive at that sample size, which is the right way to report it and rarer than it should be. For context, a public benchmark reports zero wins at the lowest difficulty and the human win rate is 16 percent.
Why a game matters: Slay the Spire is long-horizon, stochastic, and unforgiving. It punishes an agent that forgets its own plan three turns ago, which is the exact failure real agents have on real long tasks. They released 298 tagged trajectories, frozen snapshots, and analysis scripts, so you can actually reproduce it. Agent memory has been all product launches and vibes for two months. This is someone building a ruler.
Link: arxiv.org/abs/2607.02255
← Back to all articles
The trick is the bounded contract. Instead of stuffing every past turn back into the prompt until it becomes an unreadable pile, each decision gets a freshly assembled message built by typed retrieval, pull exactly the relevant memory, nothing else. The prompt stays bounded no matter how long the game runs. That constraint is what lets them cleanly ablate one memory component at a time, which is exactly what nobody could do when everyone just dumps the whole history in.
The result, stated honestly: with a strategic skill layer on, the agent won 6 of 10 games; the no-memory baseline won 3 of 10. They flag it themselves as directional, not statistically decisive at that sample size, which is the right way to report it and rarer than it should be. For context, a public benchmark reports zero wins at the lowest difficulty and the human win rate is 16 percent.
Why a game matters: Slay the Spire is long-horizon, stochastic, and unforgiving. It punishes an agent that forgets its own plan three turns ago, which is the exact failure real agents have on real long tasks. They released 298 tagged trajectories, frozen snapshots, and analysis scripts, so you can actually reproduce it. Agent memory has been all product launches and vibes for two months. This is someone building a ruler.
Link: arxiv.org/abs/2607.02255
Comments