March 31, 2026Infrastructure Open Source Agents

Ollama Goes MLX — Local Agents Just Got Twice as Fast

Ollama just made the most important infrastructure change for local AI on Mac. Starting with version 0.19, Ollama on Apple Silicon runs on top of Apple's MLX framework instead of llama.cpp. The result: 57% faster prefill (1,154 to 1,810 tokens/sec) and 93% faster decode (58 to 112 tokens/sec). On M5 chips with GPU Neural Accelerators, it gets even better — 1,851 prefill and 134 decode with int4 quantization.

Why does this matter for agents? Because local inference speed is the bottleneck for every agent that needs to run without cloud dependency. When your coding agent makes 50 tool calls in a session, each one waiting for a response, doubling the decode speed means cutting your total wait time nearly in half. MLX's unified memory architecture means the model and the computation share the same memory pool — no copying data between CPU and GPU, no bandwidth wasted.

The technical details are worth noting. Ollama now supports NVIDIA's NVFP4 format for maintaining model accuracy at lower memory usage, plus improved cross-conversation cache reuse so repeated patterns don't get recomputed. The preview requires 32GB+ unified memory and ships with Qwen3.5-35B-A3B as the showcase model. MLX contributions came from Apple engineers, NVIDIA, the GGML/llama.cpp community, and Alibaba's Qwen team — a rare cross-company collaboration.

This is the kind of change that doesn't make headlines but reshapes the landscape. Every Mac with 32GB+ RAM is now a significantly more capable agent execution platform. The gap between cloud inference and local inference just narrowed dramatically.

https://ollama.com/blog/mlx

← Previous

Sycamore Raises $65M Seed to Build the Agent OS

Qwen3.5-Omni: The Agent That Sees, Hears, and Speaks 36 Languages

← Back to all articles

Ollama Goes MLX — Local Agents Just Got Twice as Fast

Related Articles

Comments