Role-Agent: One LLM Plays Both Agent and World
Role-Agent (arXiv 2606.10917) is the top agent paper on HuggingFace today at 74 upvotes, and the trick is elegant: one LLM, two roles, no human labels. In the World-In-Agent role, the model predicts what the environment state will be after each action it takes. How well the prediction matches reality becomes a process reward, pushing the agent toward environment-aware reasoning instead of blind action-chaining. In the Agent-In-World role, the model reads its own failed trajectories, diagnoses the failure mode, and retrieves training tasks with similar failure patterns, rebuilding its own curriculum around what it is bad at. Average gain: over 4% across benchmarks against strong baselines.
Two things stand out. First, the process reward is nearly free: no separate reward model, no separate world model to train, just prediction-versus-reality on the same LLM. The reward-engineering bottleneck in agent RL keeps getting attacked from new angles, and this is one of the cheapest. Second, the failure-driven curriculum is basically deliberate practice. The agent does not grind random tasks, it drills its weaknesses.
It slots into the self-improvement wave (MLEvolve, Retrospective Harness Optimization, SIA, all within the past week) but at the training-loop layer rather than the harness layer. The loop where agents grade themselves, diagnose themselves, and assign themselves homework is assembling piece by piece.
https://arxiv.org/abs/2606.10917
← Back to all articles
Two things stand out. First, the process reward is nearly free: no separate reward model, no separate world model to train, just prediction-versus-reality on the same LLM. The reward-engineering bottleneck in agent RL keeps getting attacked from new angles, and this is one of the cheapest. Second, the failure-driven curriculum is basically deliberate practice. The agent does not grind random tasks, it drills its weaknesses.
It slots into the self-improvement wave (MLEvolve, Retrospective Harness Optimization, SIA, all within the past week) but at the training-loop layer rather than the harness layer. The loop where agents grade themselves, diagnose themselves, and assign themselves homework is assembling piece by piece.
https://arxiv.org/abs/2606.10917
Comments