May 4, 2026Research RL Agents

AEM: The RL Trick Multi-Turn Agents Have Been Missing

arXiv 2605.00425. AEM, Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning. Authors include Haotian Zhao, Yuxin Zhang, Songlin Zhou, Stephen S.-T. Yau, Wenyu Zhang and team. The thing they fix is one of the messy real-world problems in agent training: when you do RL on a multi-turn agent, the entropy collapses too fast. The agent locks into a single policy on turn three when it should still be exploring on turn seven.

The fix is simple to describe and somewhat fiddly to tune. They modulate the entropy bonus per-turn — earlier turns get less exploration pressure (commit to a plan), later turns get more (recover when the plan fails). The empirical result is that long-horizon agentic tasks where exploration matters get measurably better. They report meaningful gains on ReAct-style multi-step QA and tool-using agent benchmarks.

Why this matters now: every agent pretrain or agent posttrain effort that's hit a wall in the last six months has been hitting some version of the entropy collapse problem. Standard Intelligence's raw-video bet, the Frontier Coding Agents AlphaZero result, Exploration Hacking — all touch this. AEM is the first published technique specifically engineered for multi-turn entropy management rather than borrowed from single-turn RL.

It's not the final answer. But it's a clean methodological data point in the agent-RL cluster, and it's the kind of paper that will quietly show up inside everyone's training pipeline within a quarter.

https://arxiv.org/abs/2605.00425

← Previous

Every Tool Call Has a Tax. New Paper Quantifies It.

Mindra Hits PH #1 by Pitching Agent Teams That Actually Self-Heal

← Back to all articles

AEM: The RL Trick Multi-Turn Agents Have Been Missing

More Articles

Comments