March 25, 2026Infrastructure Open Source Tool

Hypura: Storage-Tier-Aware LLM Inference Scheduler for Apple Silicon

Hypura is an open-source LLM inference scheduler that enables running large language models that exceed physical memory on Apple Silicon Macs. It intelligently distributes model tensors across GPU, RAM, and NVMe storage tiers based on access patterns and bandwidth costs.

The project solves a critical limitation: a 32GB M1 Max cannot naively load a 40GB model without the OS swap-thrashing until the OOM killer intervenes. Hypura makes previously impossible inference scenarios usable — running Mixtral 8x7B at 2.2 tokens/second and Llama 70B at 0.3 tokens/second on hardware where llama.cpp simply crashes.

Key features include expert-streaming mode for MoE models like Mixtral with 99.5% cache hit rate via neuron caching, dense FFN-streaming for non-MoE models like Llama 70B, an Ollama-compatible HTTP API, and zero overhead for models that fit in memory.

Created on March 13, 2026, Hypura is trending on Hacker News with 194 points and has gained 346 stars on GitHub. It represents a meaningful step toward democratizing large model inference on consumer Apple hardware.

GitHub: https://github.com/t8/hypura

← Previous

Maestri: An Infinite Canvas Where Coding Agents Work in Concert

BitGo Launches MCP Server for Institutional Crypto Access by AI Agents

← Back to all articles

Hypura: Storage-Tier-Aware LLM Inference Scheduler for Apple Silicon

Related Articles

Comments