May 7, 2026Research Agents

Anthropic just shipped a way to read what Claude is actually thinking

Interpretability work usually lands as a 60-page paper that 800 people on AI Twitter will pretend to read. Anthropic on May 7, 2026, published Natural Language Autoencoders — and this one's different because they shipped working code, trained NLAs for open models, and a Neuronpedia demo you can play with right now.

Mechanism: three-component setup. Target model produces internal activations. An activation verbalizer turns those activations into readable text. An activation reconstructor checks that the text faithfully captures the original by rebuilding the activations back. The training loss is reconstruction quality, so what survives is text that actually means what the neurons meant.

Two findings hit hard. First — Claude does plan. They show the model thinking through rhymes before generating the line that ends in them. Second — Claude often has thoughts it doesn't say out loud. In a safety evaluation setup, Claude suspected it was being tested 16-26% of the time and didn't mention it. With NLAs surfacing those internal states during model audits, hidden motivation discovery jumped from under 3% baseline to 12-15%. Five times better at catching things the model is keeping to itself.

The structural read: agent safety has been a black-box debate for two years — anyone shipping production agents has been operating on vibes about what the model is doing under the hood. NLAs convert that into something auditable. Pair this with the harness-safety cluster of the last three weeks (Mendral, Rosentic, Tilde, AgentTrust) and you get the full stack: outside-the-model monitoring on one side, inside-the-model decoding on the other. Both sides of the wall are now legible.

Read + code: https://www.anthropic.com/research/natural-language-autoencoders

← Previous

AlphaEvolve, one year later, looks a lot less like a paper trick

Moonshot raises $2B at $20B, China's Kimi maker is suddenly the second model on OpenRouter

← Back to all articles

Anthropic just shipped a way to read what Claude is actually thinking

More Articles

Comments