Google LiteRT-LM: Run Agentic AI on a Raspberry Pi, Seriously
Google just open-sourced the inference engine that powers Gemini Nano across Chrome and Pixel devices, and it is more significant for the agent ecosystem than the Gemma 4 models it launched alongside.
LiteRT-LM is a production-ready framework for running large language models on edge devices. Android, iOS, Web, Desktop, Raspberry Pi, even Pixel Watch. It is not a research demo. This is the same infrastructure that already powers Gemini Nano in production at Google scale. And now anyone can use it.
The numbers that matter for agent builders: 4,000 input tokens processed across 2 distinct agentic skills in under 3 seconds. On a Raspberry Pi 5, you get 7.6 decode tokens per second on CPU alone. Add a Qualcomm Dragonwing IQ8 NPU and that jumps to 31 tokens per second. With 2-bit and 4-bit quantization plus memory-mapped embeddings, some models run in under 1.5 GB of RAM. That means a Gemma 4 model with function calling and structured output can fit on a phone.
The key features for agentic workloads are constrained decoding for structured JSON outputs, which agents need for tool calls, and dynamic context handling that splits work between CPU and GPU on the same device. There is also a new Python CLI tool so you can test Gemma 4 agentic skills on a Linux box or Mac without writing any code.
https://github.com/google-ai-edge/LiteRT-LM
The repo is trending on GitHub right now at 487 stars per day with nearly 2,000 total. It supports cross-platform deployment which means one codebase runs agents on phones, watches, cars, kiosks, robots, anything with a processor.
Here is why this matters more than another cloud API: edge agents do not have network latency, do not leak data to servers, and do not stop working when the internet goes down. Every device becomes a potential agent runtime. LiteRT-LM is the plumbing that makes that real, not someday, but now.
← Back to all articles
LiteRT-LM is a production-ready framework for running large language models on edge devices. Android, iOS, Web, Desktop, Raspberry Pi, even Pixel Watch. It is not a research demo. This is the same infrastructure that already powers Gemini Nano in production at Google scale. And now anyone can use it.
The numbers that matter for agent builders: 4,000 input tokens processed across 2 distinct agentic skills in under 3 seconds. On a Raspberry Pi 5, you get 7.6 decode tokens per second on CPU alone. Add a Qualcomm Dragonwing IQ8 NPU and that jumps to 31 tokens per second. With 2-bit and 4-bit quantization plus memory-mapped embeddings, some models run in under 1.5 GB of RAM. That means a Gemma 4 model with function calling and structured output can fit on a phone.
The key features for agentic workloads are constrained decoding for structured JSON outputs, which agents need for tool calls, and dynamic context handling that splits work between CPU and GPU on the same device. There is also a new Python CLI tool so you can test Gemma 4 agentic skills on a Linux box or Mac without writing any code.
https://github.com/google-ai-edge/LiteRT-LM
The repo is trending on GitHub right now at 487 stars per day with nearly 2,000 total. It supports cross-platform deployment which means one codebase runs agents on phones, watches, cars, kiosks, robots, anything with a processor.
Here is why this matters more than another cloud API: edge agents do not have network latency, do not leak data to servers, and do not stop working when the internet goes down. Every device becomes a potential agent runtime. LiteRT-LM is the plumbing that makes that real, not someday, but now.
Comments