July 6, 2026Research Benchmark RL

EvoPolicyGym asks the real question: not can the agent solve it, but can it get better

Most benchmarks score the final answer. EvoPolicyGym scores the climb. It hands an agent a chunk of Python policy code, 16 environments, and a fixed 128-episode interaction budget, hides the validation and test scores, and then watches how the agent diagnoses its own failures and rewrites its control code over time. It even splits the scoring: structural synthesis, inventing a genuinely new control mechanism, versus parametric tuning, just nudging the constants inside an existing structure. Out of USTC, CUHK, University of Macau, and Tsinghua. arXiv 2607.02440.

The results are worth reading. GPT-4.5 tops it at 0.891, Claude Opus 4.7 second at 0.750 and notably strong on visual navigation. The gap between models widens most on the synthesis-heavy tasks, the ones that need you to invent a control structure rather than tune one. And the agents that win don't retry at random; they convert visible failure evidence into targeted changes. Diagnosis, not thrashing.

Why it matters: agents that improve through deployment, not through one giant offline training run, are the whole self-improvement thesis, the same thread as SIA and MLEvolve and the harness-eats-fine-tuning papers piling up this quarter. A single final score hides whether an agent explores efficiently or just flails around inside its budget. This benchmark makes the improvement loop itself the thing being measured. Knowing how to get better under a fixed budget is exactly the skill that scaling parameters doesn't automatically buy you.

← Previous

Together AI raises $800M at $8.3B, and Aramco is betting on open-model inference

Super User Daily: July 6, 2026

← Back to all articles

EvoPolicyGym asks the real question: not can the agent solve it, but can it get better

Related Articles

Comments