April 29, 2026ResearchRLAgents

TCOD — fixing the silent KL bug that breaks multi-turn agent training

A paper that landed on arXiv April 27 named the bug a lot of multi-turn agent training pipelines have been hitting and not properly diagnosing. CUHK + Alibaba + KAUST collaboration.

The bug is trajectory-level KL instability. On-policy distillation—teacher model demonstrates, student model imitates while interacting with the environment—is the standard recipe for training small agents. In multi-turn settings, the authors document a pathology: KL divergence keeps rising while success rate drops. Inter-turn error compounding pushes the student outside the teacher's effective guidance distribution, and additional training makes things worse instead of better.

The fix is basic enough that you wonder why nobody spelled it out before: curriculum scheduling on trajectory depth. Start the student on short trajectories (3 steps), expand to longer ones. Let the model first learn to be correct inside the teacher's comfort zone, then learn to carry past the point where the teacher gives up.

Tested on ALFWorld, WebShop, and ScienceWorld, with four student-teacher pairs. Up to 18-point gains over vanilla on-policy distillation. The kicker: students outperform teachers on some tasks and generalize to ones where the teacher outright fails.

This is the kind of paper the agent pretraining wave should have produced six months ago. The space has been working through corner cases—Tool Attention, SkillSynth, RecursiveMAS—and TCOD nails the specific failure mode where naive KL minimization actively damages a long-horizon agent. Anyone running a production agent-training pipeline should run the curriculum schedule patch this week. The fix is a few lines in the training loop.

arXiv: https://arxiv.org/abs/2604.24005
← Previous
Plurai — eval and guardrail models you train by describing them
Next →
Super User Daily: 2026-04-30
← Back to all articles

Comments

Loading...
>_