April 12, 2026ResearchOpen SourceInfrastructure

DMax Squeezes 6x More Tokens Per Forward Pass Out of Diffusion LLMs

Diffusion LLMs are the new hot architecture — they generate multiple tokens simultaneously instead of one at a time. But they have a nasty problem: the more tokens you try to generate in parallel, the more errors accumulate, and quality crashes. DMax from NUS cracks this with a surprisingly elegant trick.

The key insight: instead of the standard mask-to-token binary transition, DMax treats decoding as progressive self-refinement. The model starts with mask embeddings and gradually polishes them into real tokens, correcting its own mistakes along the way. The training method (On-Policy Uniform Training) teaches the model to recover clean tokens from both masked inputs and its own bad predictions. This means the model learns to fix errors it actually makes, not just theoretical ones.

Results: 6.0 tokens per forward pass on math and reasoning, 6.6 on code — while maintaining accuracy. For context, standard autoregressive models do exactly 1.0. That is a 6x throughput boost with no quality loss on the benchmarks they tested.

Code, three 16B models (math, code, general purpose), and training datasets all released on GitHub and HuggingFace. 81 upvotes on HF Daily Papers. 83 GitHub stars since release on April 10.

https://github.com/czg1225/DMax
https://huggingface.co/papers/2604.08302
← Previous
Ray: An AI Financial Advisor That Never Sees Your Data Leave Your Laptop
← Back to all articles

Comments

Loading...
>_