May 8, 2026InfrastructureOpen SourceResearch

dflash Brings Block Diffusion to Speculative Decoding

z-lab open-sourced dflash this week. 388 stars/day on GitHub Trending, 3,800 total. The contribution: a lightweight block diffusion model used as the draft model in speculative decoding, drafting 15-16 tokens in parallel per block instead of one at a time.

Backends supported: vLLM, SGLang, Transformers, MLX. 15+ target models including Qwen, Gemma-4, Llama. The big news is integration into vLLM v0.20.1+ as native support — that's the production line for serving open-weight LLMs, and dflash is now there as a first-class draft architecture.

Speculative decoding has been the dominant inference acceleration story for two years. The bottleneck is always the draft model: too small and the acceptance rate tanks; too big and the draft itself becomes the bottleneck. Block diffusion targets a different operating point — diffusion can draft a whole block in parallel from a single forward pass, sidestepping the depth-vs-quality tradeoff that autoregressive draft models hit.

For agent workloads this matters more than for chat. Agents do many short generations across tool calls. Token-amortized inference cost dominates the agent cloud bill. Faster speculative decoding directly cuts agent cost-per-task.

The structural read: inference acceleration for agent workloads is its own subdiscipline now. Eagle, Medusa, and now block diffusion are converging on the same insight — agentic generation is structurally different from chat generation, and the optimal draft architecture is different too. Watch for Anthropic and OpenAI to publish or acquire similar work in the next 90 days. Source: https://github.com/z-lab/dflash
← Previous
re_gent Wants to Be Git for Coding Agents
Next →
9router Wraps 40 AI Coding Subscriptions Behind One Proxy
← Back to all articles

Comments

Loading...
>_