Voicebox: The Open-Source ElevenLabs Killer
ElevenLabs charges you per character. Voicebox lets you clone any voice from a few seconds of audio, generate speech in 23 languages, and never send a byte to the cloud. All for free.
Voicebox is a local-first voice cloning studio built with Tauri (Rust, not Electron — it matters for performance). It ships five TTS engines including Alibaba's Qwen3-TTS, which achieves near-perfect voice cloning quality. You get post-processing effects — pitch shift, reverb, compression — plus a multi-track timeline editor for composing conversations and podcasts. There's even a REST API so you can pipe voice synthesis into your own apps.
The real story here is what this enables for agents. Voice is the most natural interface for human-agent interaction, and having high-quality local TTS removes the biggest bottleneck: latency and privacy. An agent that can speak in any voice, in 23 languages, with sub-second response times, running entirely on your machine — that's a fundamentally different UX than waiting for a cloud API round-trip.
Voicebox runs on macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, and Docker. 16K stars on GitHub and climbing 652 per day. The architecture is clean: React + TypeScript frontend, FastAPI backend, SQLite for state.
https://github.com/jamiepine/voicebox
← Back to all articles
Voicebox is a local-first voice cloning studio built with Tauri (Rust, not Electron — it matters for performance). It ships five TTS engines including Alibaba's Qwen3-TTS, which achieves near-perfect voice cloning quality. You get post-processing effects — pitch shift, reverb, compression — plus a multi-track timeline editor for composing conversations and podcasts. There's even a REST API so you can pipe voice synthesis into your own apps.
The real story here is what this enables for agents. Voice is the most natural interface for human-agent interaction, and having high-quality local TTS removes the biggest bottleneck: latency and privacy. An agent that can speak in any voice, in 23 languages, with sub-second response times, running entirely on your machine — that's a fundamentally different UX than waiting for a cloud API round-trip.
Voicebox runs on macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, and Docker. 16K stars on GitHub and climbing 652 per day. The architecture is clean: React + TypeScript frontend, FastAPI backend, SQLite for state.
https://github.com/jamiepine/voicebox
Comments