July 3, 2026ResearchRLAgents

RL Post-Training Might Only Need One Layer

Here's a result that should make every lab's finance team sit up. A new paper, Is One Layer Enough? (arXiv 2607.01232, on HN's front page today), finds that training a single transformer layer can recover most of the gains of full-parameter RL training — and sometimes beat it. The authors introduce layer contribution, a measure of how much of the full RL improvement you get back by training one layer in isolation, and they test it properly: seven models across Qwen3 and Qwen2.5, three RL algorithms (GRPO, GiGPO, Dr.GRPO), on math reasoning, code generation and agentic decision-making.

The pattern is consistent: the layers that matter concentrate in the middle of the stack, while layers near the input and output contribute almost nothing. Which means the conventional assumption — RL post-training needs to touch everything — has been quietly wasting most of the compute. If one mid-stack layer carries the improvement, the cost of turning a base model into an agent collapses, and the door opens to cheap per-task RL the way LoRA opened cheap per-task fine-tuning.

The finding also says something about where agency lives in these networks. RL isn't rewiring the whole model into an agent; it's adjusting a narrow band in the middle. That's consistent with a growing body of work suggesting agent capability is more localized and more portable than the full-parameter orthodoxy assumed — the same instinct behind the skills-into-weights research line (LatentSkill, OPID).

The obvious next question: does this hold at frontier scale, and does single-layer RL stack with the harness-optimization results that keep landing (Retrospective Harness Optimization, SIA)? If both hold, the recipe for a capable agent gets very cheap, very fast.

https://arxiv.org/abs/2607.01232
← Previous
Manufact Wants to Be the Vercel of MCP
Next →
Retrace: A Real Debugger for Agents, Finally
← Back to all articles

Comments

Loading...
>_