Mesh LLM: Turn Your Spare GPUs Into One Big Inference Cloud
Running large open models locally has always been a "one big machine" problem. Mesh LLM flips it: pool spare GPU capacity across multiple machines and expose the result as a single OpenAI-compatible API endpoint. No config, no manual sharding, no custom client.
The clever part is how it handles distribution. If a model fits on one machine, it runs there. If not, Mesh LLM automatically splits the work: dense models get pipeline parallelism, MoE models get expert sharding with zero cross-node inference traffic. That last part matters β for models like Qwen3, GLM, Mixtral, and DeepSeek, each node gets the full trunk plus an overlapping expert shard. Critical experts get replicated everywhere, the rest distributed uniquely. The result is that each node runs its own llama-server independently during inference.
For the agent ecosystem, this is a missing piece. Agents built on open models (through Goose, Claude Code, or any OpenAI-compatible framework) can now access larger models than any single machine in the office could run. A team with three mediocre GPU boxes can collectively serve a model that previously required a single expensive rig.
Mesh LLM also includes a web console at localhost:3131 with live topology visualization, multi-model serving with request-based routing, and multimodal support including vision and audio. It runs as a background service on macOS and Linux, with Windows support coming.
The project is open-source at github.com/michaelneale/mesh-llm with 537 stars. It was built as part of the Goose project to make open models accessible to people who don't have a single beefy machine but collectively have enough spare capacity.
← Back to all articles
The clever part is how it handles distribution. If a model fits on one machine, it runs there. If not, Mesh LLM automatically splits the work: dense models get pipeline parallelism, MoE models get expert sharding with zero cross-node inference traffic. That last part matters β for models like Qwen3, GLM, Mixtral, and DeepSeek, each node gets the full trunk plus an overlapping expert shard. Critical experts get replicated everywhere, the rest distributed uniquely. The result is that each node runs its own llama-server independently during inference.
For the agent ecosystem, this is a missing piece. Agents built on open models (through Goose, Claude Code, or any OpenAI-compatible framework) can now access larger models than any single machine in the office could run. A team with three mediocre GPU boxes can collectively serve a model that previously required a single expensive rig.
Mesh LLM also includes a web console at localhost:3131 with live topology visualization, multi-model serving with request-based routing, and multimodal support including vision and audio. It runs as a background service on macOS and Linux, with Windows support coming.
The project is open-source at github.com/michaelneale/mesh-llm with 537 stars. It was built as part of the Goose project to make open models accessible to people who don't have a single beefy machine but collectively have enough spare capacity.
Comments