2026年3月25日InfrastructureOpen SourceTool

Hypura: Storage-Tier-Aware LLM Inference Scheduler for Apple Silicon

Hypura is an open-source LLM inference scheduler that enables running large language models that exceed physical memory on Apple Silicon Macs. It intelligently distributes model tensors across GPU, RAM, and NVMe storage tiers based on access patterns and bandwidth costs.

The project solves a critical limitation: a 32GB M1 Max cannot naively load a 40GB model without the OS swap-thrashing until the OOM killer intervenes. Hypura makes previously impossible inference scenarios usable — running Mixtral 8x7B at 2.2 tokens/second and Llama 70B at 0.3 tokens/second on hardware where llama.cpp simply crashes.

Key features include expert-streaming mode for MoE models like Mixtral with 99.5% cache hit rate via neuron caching, dense FFN-streaming for non-MoE models like Llama 70B, an Ollama-compatible HTTP API, and zero overhead for models that fit in memory.

Created on March 13, 2026, Hypura is trending on Hacker News with 194 points and has gained 346 stars on GitHub. It represents a meaningful step toward democratizing large model inference on consumer Apple hardware.

GitHub: https://github.com/t8/hypura
← 上一篇
Maestri:编码智能体协作的无限画布
下一篇 →
Hypura:Apple Silicon 上的存储感知 LLM 推理调度器
← 返回所有文章

评论

加载中...
>_