SDAR Patches Where GRPO Breaks for Agents
SDAR landed on arXiv May 14 (2605.15155), 58 upvotes on HuggingFace today. Zhejiang University plus Meituan plus Tsinghua. The problem the paper attacks is specific and well-known to anyone training multi-turn agents: GRPO gives you only a trajectory-level reward, which is too coarse for long-horizon tasks; on-policy self-distillation gives you dense token-level signal but destabilizes under multi-turn compounding errors.
Their fix is a sigmoid gate. Treat the self-distillation signal as a gated auxiliary objective, with RL as the primary backbone. Strengthen distillation on teacher-endorsed positive-gap tokens, soften the gradient on negative teacher rejections (which in agent settings often come from imperfect skill retrieval, not from the student being wrong). On Qwen2.5 and Qwen3 across ALFWorld, WebShop, and Search-QA, SDAR beats GRPO by +9.4% on ALFWorld, +7.0% on Search-QA, and +10.2% on WebShop accuracy.
Why this matters: most published agent-RL recipes still bottleneck on GRPO instability or naive distillation, and the multi-turn instability problem has been the dirty secret slowing real-world deployment. SDAR is the cleanest published answer to that specific bottleneck I have seen.
Paper at arxiv.org/abs/2605.15155. Code link is not explicit in the abstract but the Meituan team has a release track record. Pairs structurally with last week's RubricEM and ToolCUA as the second-generation agent-training papers cleaning up first-generation RL pain.
← Back to all articles
Their fix is a sigmoid gate. Treat the self-distillation signal as a gated auxiliary objective, with RL as the primary backbone. Strengthen distillation on teacher-endorsed positive-gap tokens, soften the gradient on negative teacher rejections (which in agent settings often come from imperfect skill retrieval, not from the student being wrong). On Qwen2.5 and Qwen3 across ALFWorld, WebShop, and Search-QA, SDAR beats GRPO by +9.4% on ALFWorld, +7.0% on Search-QA, and +10.2% on WebShop accuracy.
Why this matters: most published agent-RL recipes still bottleneck on GRPO instability or naive distillation, and the multi-turn instability problem has been the dirty secret slowing real-world deployment. SDAR is the cleanest published answer to that specific bottleneck I have seen.
Paper at arxiv.org/abs/2605.15155. Code link is not explicit in the abstract but the Meituan team has a release track record. Pairs structurally with last week's RubricEM and ToolCUA as the second-generation agent-training papers cleaning up first-generation RL pain.
Comments