March 22, 2026InfrastructureOpen SourceTool

Flash-MoE: Running a 397B Parameter Model on a MacBook with Pure C and Metal

Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B — a 397 billion parameter Mixture-of-Experts model — on a MacBook Pro with just 48GB of RAM at 4.4+ tokens per second with production-quality output including tool calling. The project is trending on Hacker News with 125+ points.

The implementation uses no Python, no frameworks — just C, Objective-C, and hand-tuned Metal shaders. The entire 209GB model streams from SSD through a custom Metal compute pipeline. The model has 60 transformer layers (45 GatedDeltaNet linear attention + 15 standard full attention), with each layer having 512 experts, of which 4 are activated per token plus one shared expert.

Key optimizations include reducing activated experts from 10 to 4 per token and 2-bit requantization of expert weights, cutting expert storage from 209GB to 120GB. Non-expert components (embedding table, routing matrices) stay at original precision, adding 5.5GB resident in memory.

For the agentic ecosystem, Flash-MoE represents a significant infrastructure advancement: it proves that frontier-class MoE models can run on consumer hardware without cloud dependencies. This enables local agent deployments with models that were previously cloud-only, a critical capability for privacy-sensitive agent workflows.

GitHub: https://github.com/danveloper/flash-moe
← Previous
PentAGI: Autonomous AI Penetration Testing Agent with Multi-Agent Architecture
Next →
Booz Allen Launches Vellox Agentic Cyber Defense Suite at RSAC 2026
← Back to all articles

Comments

Loading...
>_