2026年3月22日InfrastructureOpen SourceTool

Flash-MoE: Running a 397B Parameter Model on a MacBook with Pure C and Metal

Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B — a 397 billion parameter Mixture-of-Experts model — on a MacBook Pro with just 48GB of RAM at 4.4+ tokens per second with production-quality output including tool calling. The project is trending on Hacker News with 125+ points.

The implementation uses no Python, no frameworks — just C, Objective-C, and hand-tuned Metal shaders. The entire 209GB model streams from SSD through a custom Metal compute pipeline. The model has 60 transformer layers (45 GatedDeltaNet linear attention + 15 standard full attention), with each layer having 512 experts, of which 4 are activated per token plus one shared expert.

Key optimizations include reducing activated experts from 10 to 4 per token and 2-bit requantization of expert weights, cutting expert storage from 209GB to 120GB. Non-expert components (embedding table, routing matrices) stay at original precision, adding 5.5GB resident in memory.

For the agentic ecosystem, Flash-MoE represents a significant infrastructure advancement: it proves that frontier-class MoE models can run on consumer hardware without cloud dependencies. This enables local agent deployments with models that were previously cloud-only, a critical capability for privacy-sensitive agent workflows.

GitHub: https://github.com/danveloper/flash-moe
← 上一篇
PentAGI:自主AI渗透测试代理,采用多代理架构
下一篇 →
Flash-MoE:用纯 C 和 Metal 在 MacBook 上运行 3970 亿参数模型
← 返回所有文章

评论

加载中...
>_