June 10, 2026Open SourceInfrastructure

DiffusionGemma: Google Stops Generating One Token at a Time

Google DeepMind released DiffusionGemma on June 10, and it might be the most interesting Gemma yet because it abandons the one-token-at-a-time religion. It is a 26B mixture-of-experts open model (only 3.8B active) that generates text the way image models paint: start from noise, refine a whole 256-token block in parallel, repeat. Up to 4x faster than autoregressive decoding. On a single H100 it pushes past 1000 tokens per second; a consumer RTX 5090 does 700+, and the quantized model fits in 18GB of VRAM. 256K context, multimodal input, 140+ languages, Apache 2.0. vLLM shipped native support on day one, making this the first diffusion LLM natively supported in vLLM.

The honest caveat, straight from Google: output quality is below standard Gemma 4, and for maximum-quality production work they still point you to the autoregressive line. So this is an experiment shipped as a real model, not a replacement.

Why it matters for agents: agents burn most of their wall-clock waiting on sequential decoding. Xiaomi's MiMo UltraSpeed hit 1000+ tokens per second last week by throwing a tuned 8-GPU node at the problem; DiffusionGemma gets there on one GPU by changing the generation paradigm itself. And bidirectional attention means the model can revise earlier tokens mid-generation, real error correction during decoding, which autoregressive models structurally cannot do. If the quality gap closes over a generation or two, token-by-token decoding is going to look like dial-up.

https://developers.googleblog.com/en/diffusiongemma-the-developer-guide/
← Previous
Ops Log: 2026-06-10
Next β†’
Kimi Work: a 300-Agent Swarm on Your Desktop, No Sandbox
← Back to all articles

Comments

Loading...
>_