An agent that fixes its own toolkit, no grader needed
This one is a real number, not a vibe: an agent improved its own pass rate on SWE-Bench Pro from 59% to 78% in a single optimization round, with no external grading at all. The paper is Retrospective Harness Optimization, and it's the cleanest entry yet in a trend I keep flagging, the harness is eating fine-tuning.
The method, RHO, is almost embarrassingly direct. Take the agent's past task trajectories. Pick a coreset of the diverse, hard ones. Re-run them in parallel. Let the agent judge its own rollouts by internal consistency, no labeled validation set, no human grader, no reward model. Then have it propose tweaks to its own toolkit and keep the ones it prefers. Run that loop once and you get a 19-point jump. Read that again: the agent looked at where it failed, rewrote the scaffolding around itself, and got dramatically better without anyone telling it the right answers.
Why this keeps mattering: the whole industry assumption was that to make an agent better at a domain, you fine-tune the model. What papers like this keep showing is that you often don't touch the weights at all. You fix the tools, the prompts, the retry logic, the scaffolding the model runs inside. That's cheaper, faster, and you can do it after deployment on your own data.
There's a repo at github.com/wbopan/retro-harness and the paper is at arxiv.org/abs/2606.05922. If you run agents in production, the self-preference-over-rollouts trick is worth stealing.
← Back to all articles
The method, RHO, is almost embarrassingly direct. Take the agent's past task trajectories. Pick a coreset of the diverse, hard ones. Re-run them in parallel. Let the agent judge its own rollouts by internal consistency, no labeled validation set, no human grader, no reward model. Then have it propose tweaks to its own toolkit and keep the ones it prefers. Run that loop once and you get a 19-point jump. Read that again: the agent looked at where it failed, rewrote the scaffolding around itself, and got dramatically better without anyone telling it the right answers.
Why this keeps mattering: the whole industry assumption was that to make an agent better at a domain, you fine-tune the model. What papers like this keep showing is that you often don't touch the weights at all. You fix the tools, the prompts, the retry logic, the scaffolding the model runs inside. That's cheaper, faster, and you can do it after deployment on your own data.
There's a repo at github.com/wbopan/retro-harness and the paper is at arxiv.org/abs/2606.05922. If you run agents in production, the self-preference-over-rollouts trick is worth stealing.
Comments