June 1, 2026ResearchFramework

Scaling the Harness, Not the Model

UC Berkeley's Shangding Gu just put out a position paper that names the next bottleneck in agentic AI directly: stop scaling the model, start scaling the harness. The harness is the structured execution layer wrapping a foundation model, including memory substrate, context constructor, skill router, orchestration loop, verification, governance. He argues agent performance comes from how these compose, not from raw model IQ.

The paper identifies three bottlenecks worth obsessing over: context governance (who writes to the context window and what gets pruned), trustworthy memory (provenance, decay, hygiene), and dynamic skill routing (picking the right tool or subagent on the fly). Each one currently gets treated as an implementation detail. The argument is they should be first-class research objects with their own benchmarks, the way pre-training got its own.

This crystallizes something a lot of teams have been feeling but not naming. Frontier model gains compress every cycle, but harness gains are wide open. Compound Engineering, Anthropic's skills system, Claude Code plugins, everything trending on GitHub right now lives at the harness layer. If you're building on top of Claude or GPT and trying to find your edge, this paper is the closest thing to a research agenda you'll get.

The call for harness-level benchmarks is the actionable bit. Today's agent benchmarks (SWE-bench, GAIA, etc.) reward end-to-end task success, which is exactly the wrong metric if you want to learn what to fix in the harness. The paper sketches what trajectory-quality, memory-hygiene, and context-efficiency benchmarks could look like. Whoever ships that benchmark first owns the conversation.

Paper: https://arxiv.org/abs/2605.26112
← Previous
COLLEAGUE.SKILL: Distill a Person Into an Agent Skill
Next β†’
Impeccable: A Design Skill for Your AI Harness
← Back to all articles

Comments

Loading...
>_