Ollama Goes MLX — Local Agents Just Got Twice as Fast
Ollama just made the most important infrastructure change for local AI on Mac. Starting with version 0.19, Ollama on Apple Silicon runs on top of Apple's MLX framework instead of llama.cpp. The result: 57% faster prefill (1,154 to 1,810 tokens/sec) and 93% faster decode (58 to 112 tokens/sec). On M5 chips with GPU Neural Accelerators, it gets even better — 1,851 prefill and 134 decode with int4 quantization.
Why does this matter for agents? Because local inference speed is the bottleneck for every agent that needs to run without cloud dependency. When your coding agent makes 50 tool calls in a session, each one waiting for a response, doubling the decode speed means cutting your total wait time nearly in half. MLX's unified memory architecture means the model and the computation share the same memory pool — no copying data between CPU and GPU, no bandwidth wasted.
The technical details are worth noting. Ollama now supports NVIDIA's NVFP4 format for maintaining model accuracy at lower memory usage, plus improved cross-conversation cache reuse so repeated patterns don't get recomputed. The preview requires 32GB+ unified memory and ships with Qwen3.5-35B-A3B as the showcase model. MLX contributions came from Apple engineers, NVIDIA, the GGML/llama.cpp community, and Alibaba's Qwen team — a rare cross-company collaboration.
This is the kind of change that doesn't make headlines but reshapes the landscape. Every Mac with 32GB+ RAM is now a significantly more capable agent execution platform. The gap between cloud inference and local inference just narrowed dramatically.
https://ollama.com/blog/mlx
← Back to all articles
Why does this matter for agents? Because local inference speed is the bottleneck for every agent that needs to run without cloud dependency. When your coding agent makes 50 tool calls in a session, each one waiting for a response, doubling the decode speed means cutting your total wait time nearly in half. MLX's unified memory architecture means the model and the computation share the same memory pool — no copying data between CPU and GPU, no bandwidth wasted.
The technical details are worth noting. Ollama now supports NVIDIA's NVFP4 format for maintaining model accuracy at lower memory usage, plus improved cross-conversation cache reuse so repeated patterns don't get recomputed. The preview requires 32GB+ unified memory and ships with Qwen3.5-35B-A3B as the showcase model. MLX contributions came from Apple engineers, NVIDIA, the GGML/llama.cpp community, and Alibaba's Qwen team — a rare cross-company collaboration.
This is the kind of change that doesn't make headlines but reshapes the landscape. Every Mac with 32GB+ RAM is now a significantly more capable agent execution platform. The gap between cloud inference and local inference just narrowed dramatically.
https://ollama.com/blog/mlx
Comments