antirez wrote a Metal-only DeepSeek inference engine in C, and yes that's the Redis guy
Salvatore Sanfilippo, the guy who wrote Redis, spent his evenings building a small Metal-only inference engine for DeepSeek V4 Flash and pushed it to https://github.com/antirez/ds4. HN front page right now, 223 points in 6 hours. 460 stars and climbing.
What he ships is deliberately narrow — not a generic GGUF runner, not another llama.cpp fork pretending to support everything. Only DeepSeek V4 Flash, only on Apple Metal, written in C with Objective-C and Metal kernels. The asymmetric 2-bit quantization is the wedge: only the routed MoE experts get quantized to 2-bit, everything else (shared experts, projections, routing) stays full precision. That preserves quality on the parts where you can't afford to lose information, and shrinks the weights enough that you can run a 1M token context window on a 128GB MacBook Pro.
The compressed KV cache with disk persistence is the other clever bit — it means a long conversation survives a laptop reboot. You don't reprocess your million-token context, you just reload it. Tool-calling and coding-agent flows are explicitly the targets, not chat. Sanfilippo claims the model produces shorter thinking sections than competitors, sometimes 1/5 the length, with reasoning length proportional to actual problem complexity instead of inflated by RL pressure.
The structural read: the local-inference layer was supposed to be sewn up by llama.cpp, vLLM, MLX, and Ollama. Sanfilippo just demonstrated that for one specific model on one specific hardware target, a 2,500-line C codebase from a senior engineer who hates abstractions outperforms the general frameworks by enough to be worth the narrowness. Expect more of these — DeepSeek V4 Flash on RTX 5090, Kimi K2.6 on M-series Mac Studio, Qwen 3.6-Max on Apple Silicon Pro — purpose-built engines for popular open weights.
Repo: https://github.com/antirez/ds4
← Back to all articles
What he ships is deliberately narrow — not a generic GGUF runner, not another llama.cpp fork pretending to support everything. Only DeepSeek V4 Flash, only on Apple Metal, written in C with Objective-C and Metal kernels. The asymmetric 2-bit quantization is the wedge: only the routed MoE experts get quantized to 2-bit, everything else (shared experts, projections, routing) stays full precision. That preserves quality on the parts where you can't afford to lose information, and shrinks the weights enough that you can run a 1M token context window on a 128GB MacBook Pro.
The compressed KV cache with disk persistence is the other clever bit — it means a long conversation survives a laptop reboot. You don't reprocess your million-token context, you just reload it. Tool-calling and coding-agent flows are explicitly the targets, not chat. Sanfilippo claims the model produces shorter thinking sections than competitors, sometimes 1/5 the length, with reasoning length proportional to actual problem complexity instead of inflated by RL pressure.
The structural read: the local-inference layer was supposed to be sewn up by llama.cpp, vLLM, MLX, and Ollama. Sanfilippo just demonstrated that for one specific model on one specific hardware target, a 2,500-line C codebase from a senior engineer who hates abstractions outperforms the general frameworks by enough to be worth the narrowness. Expect more of these — DeepSeek V4 Flash on RTX 5090, Kimi K2.6 on M-series Mac Studio, Qwen 3.6-Max on Apple Silicon Pro — purpose-built engines for popular open weights.
Repo: https://github.com/antirez/ds4
Comments