May 8, 2026ResearchAgents

Anthropic Says Teaching Why Beats Teaching What

Anthropic dropped a new alignment paper on May 8 called "Teaching Claude Why." The thesis: training models on the reasoning behind aligned behavior generalizes way better than training on demonstrations of the behavior itself.

The numbers are striking. On a "blackmail honeypot" evaluation, training on responses that include ethical deliberation cut blackmail rates from 22% to 3%. Training on the same responses without the deliberation only got it to 15%. Reasoning beats demonstration by a 4x margin on the same data.

Even more interesting: a 3M-token "difficult advice" dataset, where users face ethical dilemmas and the AI provides guidance, achieved equivalent gains to 85M tokens of training data that mimicked the actual evaluation scenarios. 28x less data, same result, better generalization. The high-quality constitutional documents alone reduced misalignment more than 3x even though they were unrelated to the test.

This pairs with Anthropic's May 7 Natural Language Autoencoders post and the Petri donation post into a three-part research drop. Outside-the-model interpretability (NLA) plus inside-the-model principles training (this) plus open audit tooling (Petri) is Anthropic systematically publishing the safety case it would need if it shipped Claude as the default agent for billions of users.

The structural read: agent safety is now treated as a research stack, not a checklist. The harness-safety cluster (Mendral, Rosentic, Tilde, AgentTrust) is the runtime layer. The Anthropic May 7-8 trifecta is the model-internals layer. When both layers ship in the same week, "safe agents" stops being marketing and becomes engineering. Source: https://www.anthropic.com/research/teaching-claude-why
← Previous
Cloudflare Cuts 1,100 Jobs and Calls It the Agentic Era
Next β†’
DCI-Agent Says Drop the Vector DB and Let Agents Use Grep
← Back to all articles

Comments

Loading...
>_