May 23, 2026ResearchRL

DelTA: stop letting formatting tokens hijack your RL signal

DelTA topped Hugging Face papers this week, and the idea behind it is the kind of thing that sounds like plumbing until you realize it is quietly bending how every RL-trained reasoning model learns. When you train a model with reinforcement learning from verifiable rewards, a correct answer gets a thumbs up that has to be spread back across all the tokens that produced it. The standard way of doing that, the authors show, gets dominated by high-frequency junk: formatting tokens, boilerplate, the words that show up in every response whether it was right or wrong. The sparse tokens that actually separate a good answer from a bad one get drowned out.

DelTA reframes the whole thing through a discriminator lens. A policy-gradient update, it turns out, behaves like a linear discriminator deciding which tokens get pushed up and which get pushed down. So DelTA estimates per-token coefficients that amplify the patterns genuinely distinguishing high-reward from low-reward responses and downweight the shared filler. On seven math benchmarks it beats comparable baselines by about 3.26 points on Qwen3-8B-Base and 2.62 on Qwen3-14B-Base, and it holds up on code generation and out-of-domain tests too.

The reason a fix this small got the most upvotes on HF this week is that it is not really about one number on one benchmark. It is a cleaner theory of what RLVR is even doing to a model token by token, and a sharper signal at the base of the stack flows downhill into every agent and reasoning system trained on top of it. Three points on a benchmark is nice. A better-understood learning signal is the thing that compounds. Paper at arxiv.org/abs/2605.21467.
← Previous
MOSS lets an agent rewrite its own source code
Next β†’
Super User Daily: 2026-05-24
← Back to all articles

Comments

Loading...
>_