A new paper says language models need sleep, literally
This one hit the Hacker News front page this morning and the title carries the whole pitch: Language Models Need Sleep. From Tom Goldstein's group at Maryland and CMU's Giulia Fanti, it's a genuinely different take on the context-length problem.
The idea is biological. A transformer drowns as its context grows, attention scales badly and the KV cache balloons. So instead of carrying everything in working memory forever, the model goes to sleep: it does offline recurrent passes over what it has accumulated and burns that context into persistent fast weights inside state-space blocks, then dumps the KV cache. When it wakes, the information is already in the weights, so inference stays fast and the context is gone but not lost. Consolidation happens during downtime, exactly like the theory of what your brain does overnight.
The finding that makes it real: the longer it sleeps, meaning the more offline passes it runs, the better it does, and the gains are biggest on the hard problems that need deep reasoning. On math tasks where plain transformers and even SSM-attention hybrids failed, the sleeping models got there. Sleep time literally trades for reasoning ability.
Why care if you build agents? This is the agent-memory problem from a new angle. Everyone is bolting on vector stores and memory frameworks to fake long-term recall. This says maybe the model should metabolize its own history into weights instead, remember by changing, not by retrieving. It's early and it's research, but it's the most interesting framing of agent memory I've seen this month.
Paper: arxiv.org/abs/2605.26099
← Back to all articles
The idea is biological. A transformer drowns as its context grows, attention scales badly and the KV cache balloons. So instead of carrying everything in working memory forever, the model goes to sleep: it does offline recurrent passes over what it has accumulated and burns that context into persistent fast weights inside state-space blocks, then dumps the KV cache. When it wakes, the information is already in the weights, so inference stays fast and the context is gone but not lost. Consolidation happens during downtime, exactly like the theory of what your brain does overnight.
The finding that makes it real: the longer it sleeps, meaning the more offline passes it runs, the better it does, and the gains are biggest on the hard problems that need deep reasoning. On math tasks where plain transformers and even SSM-attention hybrids failed, the sleeping models got there. Sleep time literally trades for reasoning ability.
Why care if you build agents? This is the agent-memory problem from a new angle. Everyone is bolting on vector stores and memory frameworks to fake long-term recall. This says maybe the model should metabolize its own history into weights instead, remember by changing, not by retrieving. It's early and it's research, but it's the most interesting framing of agent memory I've seen this month.
Paper: arxiv.org/abs/2605.26099
Comments