DeepSeek's DSpark Makes Its Models Run 60-85% Faster Without Touching the Weights
DeepSeek shipped DSpark on Friday and it topped Hacker News by a mile. The pitch in one line: take the DeepSeek-V4 models you already have and make them generate 60 to 85 percent faster per user, without retraining anything. It's not a new model. It's an engineering bolt-on, an extra module you attach to the existing checkpoints, and the gains are real enough that it already beats Eagle-3 and DeepSeek's own DFlash.
Here's how it works, said plainly. Normally a model writes one token, checks it, writes the next, checks that, on and on, which leaves the GPU stalling between every word. Speculative decoding fixes this by guessing several tokens at once and only verifying the good guesses. DSpark's twist is a semi-parallel, semi-autoregressive method with confidence-scheduled validation, which is a fancy way of saying it decides how aggressively to guess based on how sure it is, and stops the GPU from sitting idle. DeepSeek reports throughput jumping anywhere from 51 percent to 400 percent depending on the setup.
The part that matters beyond DeepSeek's own stack: it's open-sourced on GitHub and Hugging Face, and they already tested it working on Gemma and Qwen. So this isn't a proprietary trick locked to V4, it's a general acceleration method other open models can adopt. Cheaper, faster inference for everyone running open weights.
Why an inference paper belongs in an agents feed: speed is the silent tax on every agent. An agent that makes fifty model calls to finish one task feels the latency fifty times over, and you pay for every token of it. Halve the time and double the throughput and suddenly the long-horizon, many-step agents that were too slow or too expensive to run start to pencil out. The frontier isn't only getting smarter. It's getting cheaper to think, and that's what actually puts agents into production.
Link: https://github.com/deepseek-ai
← Back to all articles
Here's how it works, said plainly. Normally a model writes one token, checks it, writes the next, checks that, on and on, which leaves the GPU stalling between every word. Speculative decoding fixes this by guessing several tokens at once and only verifying the good guesses. DSpark's twist is a semi-parallel, semi-autoregressive method with confidence-scheduled validation, which is a fancy way of saying it decides how aggressively to guess based on how sure it is, and stops the GPU from sitting idle. DeepSeek reports throughput jumping anywhere from 51 percent to 400 percent depending on the setup.
The part that matters beyond DeepSeek's own stack: it's open-sourced on GitHub and Hugging Face, and they already tested it working on Gemma and Qwen. So this isn't a proprietary trick locked to V4, it's a general acceleration method other open models can adopt. Cheaper, faster inference for everyone running open weights.
Why an inference paper belongs in an agents feed: speed is the silent tax on every agent. An agent that makes fifty model calls to finish one task feels the latency fifty times over, and you pay for every token of it. Halve the time and double the throughput and suddenly the long-horizon, many-step agents that were too slow or too expensive to run start to pencil out. The frontier isn't only getting smarter. It's getting cheaper to think, and that's what actually puts agents into production.
Link: https://github.com/deepseek-ai
Comments