RAO Trains Agents to Spawn Smaller Copies of Themselves
A CMU group dropped a paper on May 7 with a sharp idea: train agents to recursively delegate sub-tasks to fresh copies of themselves. Recursive Agent Optimization, by Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar, and Graham Neubig — Aviral and Neubig together is the lineage you watch for in agent RL.
The intuition is what every harness engineer has tried to hand-engineer with subagents and tool calls. RAO bakes the recursion into RL. The agent learns when to delegate, what to delegate, how to communicate the result back, and how to compose the partial answers into a final answer. No new architecture, no longer context windows, no special inference tricks — just an RL training signal that rewards effective decomposition.
Why this matters: every long-horizon agent today hits the same wall. The context fills up with intermediate state and the agent loses coherence. Existing answers are either bigger context windows (expensive, brittle) or hand-built multi-agent frameworks (rigid, doesn't generalize). RAO is the third answer — let the model itself learn divide-and-conquer the way a senior engineer learns it. Inference-time scaling without architectural changes is exactly the property the agent training community has been chasing for a year.
Editorial spine: pair this with SkillOS (skill curator + frozen executor, also May 7) and you start seeing the shape of the next agent training paradigm. Skills compose horizontally, recursive delegation composes vertically. Code isn't released yet, but with this author list it'll land. Watch this paper through the next two HuggingFace daily cycles — if it crosses 30 upvotes the recursive-delegation wedge becomes the editorial moment for the week.
https://arxiv.org/abs/2605.06639
← Back to all articles
The intuition is what every harness engineer has tried to hand-engineer with subagents and tool calls. RAO bakes the recursion into RL. The agent learns when to delegate, what to delegate, how to communicate the result back, and how to compose the partial answers into a final answer. No new architecture, no longer context windows, no special inference tricks — just an RL training signal that rewards effective decomposition.
Why this matters: every long-horizon agent today hits the same wall. The context fills up with intermediate state and the agent loses coherence. Existing answers are either bigger context windows (expensive, brittle) or hand-built multi-agent frameworks (rigid, doesn't generalize). RAO is the third answer — let the model itself learn divide-and-conquer the way a senior engineer learns it. Inference-time scaling without architectural changes is exactly the property the agent training community has been chasing for a year.
Editorial spine: pair this with SkillOS (skill curator + frozen executor, also May 7) and you start seeing the shape of the next agent training paradigm. Skills compose horizontally, recursive delegation composes vertically. Code isn't released yet, but with this author list it'll land. Watch this paper through the next two HuggingFace daily cycles — if it crosses 30 upvotes the recursive-delegation wedge becomes the editorial moment for the week.
https://arxiv.org/abs/2605.06639
Comments