April 5, 2026Open SourceFrameworkAgentsBenchmark

AutoAgent Proves Agents Engineer Themselves Better Than We Do

Kevin Gu's team at ThirdLayer just dropped something that should make every prompt engineer uncomfortable. AutoAgent is an open-source framework where a meta-agent autonomously engineers other agents overnight. You point it at a benchmark, write a directive in program.md, and go to sleep. When you wake up, your agent is #1 on the leaderboard.

Not metaphorically. Literally. After 24 hours of self-optimization, AutoAgent hit 96.5% on SpreadsheetBench and 55.1% on TerminalBench. Both are #1 scores. Every other entry on those leaderboards was human-engineered. This one wasn't.

The architecture is almost offensively simple. One Python file (agent.py) that the meta-agent can edit. Docker containers for safety. Harbor's task format for evaluation. The meta-agent reads failure traces, hypothesizes improvements, modifies the harness, benchmarks, and loops. Thousands of parallel simulations overnight. It keeps what works, reverts what doesn't.

The real finding isn't the benchmark numbers, it's that agents see their own failure modes better than we do. They design action spaces differently than a human would. Not worse, not the same, structurally different. This is the AI-improves-AI pattern applied to agent engineering itself, and it works embarrassingly well.

2,600 stars in three days. MIT license. https://github.com/kevinrgu/autoagent
← Previous
Omni-SimpleMem: An Agent Designed Its Own Memory System, and It's 4x Better
Next β†’
DigitalOcean Buys Katanemo Labs to Close the Agent Production Gap
← Back to all articles

Comments

Loading...
>_