Cactus Compute Distilled Gemini Tool Calling Into a 26 Million Parameter Model. It Runs on a Watch.
Cactus Compute released Needle yesterday. Henry Ndubuaku, Jakub Mroz, Karen Mosoyan, Roman Shemet, and team. A 26 million parameter model distilled from Gemini 3.1, designed for function calling on devices that cannot afford a frontier LLM in their pocket — phones, watches, smart glasses, embedded boards. The repo is on the HN front page with 174 points at 3 hours old.
The pitch is plain. Function calling is the single most important agent primitive that does not need a 70 billion parameter model behind it. Given a natural language query, output a JSON tool call. Cactus trained a 26M Simple Attention Network on 200 billion tokens of pretrain in 27 hours on 16 TPU v6e chips, then post-trained on 2 billion tokens of distilled Gemini tool-calling traces in 45 minutes. The result outperforms FunctionGemma-270m, Qwen-0.6B, and other 350m-parameter baselines on single-shot function calling.
The production-side numbers are what should make every on-device agent team look twice. 6000 tokens per second prefill. 1200 tokens per second decode. Those are throughput numbers that work inside a watchOS or wearOS frame budget, which is the thing that has stopped agentic tool use from landing on wearables until now. Run it locally, no API latency, no API spend, no API privacy concerns.
The meta-pattern. Distillation of frontier-model behavior into tiny task-specific models has been an academic curiosity for two years. Needle is the first credible attempt at making it the default for the on-device agent layer. If this works at production scale, every wearable platform vendor — Apple Watch, Wear OS, Meta Ray-Bans, Humane — has the same question on their roadmap. Do you put a frontier model in a cloud round-trip, or do you put a Needle-style 26M model on the device.
This pairs with the cluster of small-model agent research from the past quarter — Frontier Coding Agents AlphaZero, small skill-routers, sub-billion-parameter function callers. The thesis is shaping up — the agent stack is going bimodal. Big reasoning model in the cloud for hard tool plans, tiny specialized models on the device for the high-frequency tool dispatch. Needle is the credible reference for the lower band. github.com/cactus-compute/Needle.
← Back to all articles
The pitch is plain. Function calling is the single most important agent primitive that does not need a 70 billion parameter model behind it. Given a natural language query, output a JSON tool call. Cactus trained a 26M Simple Attention Network on 200 billion tokens of pretrain in 27 hours on 16 TPU v6e chips, then post-trained on 2 billion tokens of distilled Gemini tool-calling traces in 45 minutes. The result outperforms FunctionGemma-270m, Qwen-0.6B, and other 350m-parameter baselines on single-shot function calling.
The production-side numbers are what should make every on-device agent team look twice. 6000 tokens per second prefill. 1200 tokens per second decode. Those are throughput numbers that work inside a watchOS or wearOS frame budget, which is the thing that has stopped agentic tool use from landing on wearables until now. Run it locally, no API latency, no API spend, no API privacy concerns.
The meta-pattern. Distillation of frontier-model behavior into tiny task-specific models has been an academic curiosity for two years. Needle is the first credible attempt at making it the default for the on-device agent layer. If this works at production scale, every wearable platform vendor — Apple Watch, Wear OS, Meta Ray-Bans, Humane — has the same question on their roadmap. Do you put a frontier model in a cloud round-trip, or do you put a Needle-style 26M model on the device.
This pairs with the cluster of small-model agent research from the past quarter — Frontier Coding Agents AlphaZero, small skill-routers, sub-billion-parameter function callers. The thesis is shaping up — the agent stack is going bimodal. Big reasoning model in the cloud for hard tool plans, tiny specialized models on the device for the high-frequency tool dispatch. Needle is the credible reference for the lower band. github.com/cactus-compute/Needle.
Comments