Google Shrinks Gemma 4 to Run on Your Phone Without Wrecking It
Google DeepMind dropped quantization-aware training checkpoints for Gemma 4 yesterday, and it hit the Hacker News front page hard. The short version: they made the whole Gemma 4 lineup small enough to run locally on phones and laptops without the usual quality faceplant that comes with squeezing a model down.
The trick is in the name. Most teams compress a model after training, post-training quantization, and the model loses accuracy because it was never built to live in 4-bit. QAT instead simulates the quantization during training, so the model learns to be good while already cramped. The payoff is concrete: roughly 72% lower memory in 4-bit with near-original performance, and a new mobile-specialized format that gets the Gemma 4 E2B footprint down to a single gigabyte. The release spans five sizes, E2B and E4B aimed at phones, and the bigger 26B-A4B and 31B now running on laptops instead of demanding a beefy home GPU.
Put this next to General Instinct launching the same day and you see the theme of the week. The frontier is no longer only about the biggest model in the biggest cluster. There is a parallel race to push real capability down onto the device, offline, in your pocket. Google open-sourcing QAT checkpoints is the model lab playing that game directly, instead of leaving on-device optimization to third parties.
For anyone building local agents, this is the difference between a toy and a tool. A capable Gemma running offline at one gigabyte means the agent on your phone does not need to call the cloud to think. https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/
← Back to all articles
The trick is in the name. Most teams compress a model after training, post-training quantization, and the model loses accuracy because it was never built to live in 4-bit. QAT instead simulates the quantization during training, so the model learns to be good while already cramped. The payoff is concrete: roughly 72% lower memory in 4-bit with near-original performance, and a new mobile-specialized format that gets the Gemma 4 E2B footprint down to a single gigabyte. The release spans five sizes, E2B and E4B aimed at phones, and the bigger 26B-A4B and 31B now running on laptops instead of demanding a beefy home GPU.
Put this next to General Instinct launching the same day and you see the theme of the week. The frontier is no longer only about the biggest model in the biggest cluster. There is a parallel race to push real capability down onto the device, offline, in your pocket. Google open-sourcing QAT checkpoints is the model lab playing that game directly, instead of leaving on-device optimization to third parties.
For anyone building local agents, this is the difference between a toy and a tool. A capable Gemma running offline at one gigabyte means the agent on your phone does not need to call the cloud to think. https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/
Comments