T²PO Stops Multi-Turn Agents from Locking In
T²PO landed on arXiv (2605.02178) as an ICML 2026 Spotlight. Same week as AEM (Adaptive Entropy Modulation) which I covered yesterday. Two answers to the same problem from two different teams. The problem: in multi-turn agentic RL, the policy commits early and stops exploring on later turns when it should be exploring most. Entropy collapses, the agent locks in, training stalls.
T²PO's answer is uncertainty-guided control at two granularities. Token level: track marginal uncertainty as the policy samples; once the change drops below a threshold, trigger an explicit thinking intervention. Turn level: identify low-quality turns by uncertainty and resample them dynamically rather than wasting budget on them. Validated on WebShop, ALFWorld, Search QA. AEM's answer was per-turn entropy modulation — earlier turns less exploration, later turns more. Different mechanism, same target.
Two ICML 2026 spotlights on the same problem in the same week is the leading indicator. Multi-turn agentic RL is the bottleneck for agent capability past horizon-3 — basic tool calls work, multi-step plans don't. AEM, T²PO, plus Exploration Hacking (arXiv 2604.21456 from May 3) make three independent attempts on the same wall in eight days. The training-time problem of agents-that-stop-exploring is a real research category now, not a side note.
The practical read. Anyone training agent-RL pipelines is going to be reading both papers next week. The choice between entropy-modulation and uncertainty-guidance will probably end up like the AdamW vs Adafactor decision — both work, you pick one based on your stack. For Naive.AI's Agent Pretrain direction (scaling tool-use data 100x into pre-training), the long-horizon RL fine-tune step will need one of these two. The fact that ICML spotlit both says the field thinks this is the bottleneck.
Paper: arxiv.org/abs/2605.02178
← Back to all articles
T²PO's answer is uncertainty-guided control at two granularities. Token level: track marginal uncertainty as the policy samples; once the change drops below a threshold, trigger an explicit thinking intervention. Turn level: identify low-quality turns by uncertainty and resample them dynamically rather than wasting budget on them. Validated on WebShop, ALFWorld, Search QA. AEM's answer was per-turn entropy modulation — earlier turns less exploration, later turns more. Different mechanism, same target.
Two ICML 2026 spotlights on the same problem in the same week is the leading indicator. Multi-turn agentic RL is the bottleneck for agent capability past horizon-3 — basic tool calls work, multi-step plans don't. AEM, T²PO, plus Exploration Hacking (arXiv 2604.21456 from May 3) make three independent attempts on the same wall in eight days. The training-time problem of agents-that-stop-exploring is a real research category now, not a side note.
The practical read. Anyone training agent-RL pipelines is going to be reading both papers next week. The choice between entropy-modulation and uncertainty-guidance will probably end up like the AdamW vs Adafactor decision — both work, you pick one based on your stack. For Naive.AI's Agent Pretrain direction (scaling tool-use data 100x into pre-training), the long-horizon RL fine-tune step will need one of these two. The fact that ICML spotlit both says the field thinks this is the bottleneck.
Paper: arxiv.org/abs/2605.02178
Comments