April 6, 2026AgentsOpen SourceTool

Parlor: Your MacBook Is Now a Real-Time Multimodal Agent

Google drops Gemma 4 E2B. Two days later, someone builds a fully local multimodal AI that sees, hears, and talks back, running entirely on a MacBook Pro.

Parlor is dead simple in concept: you open a browser tab, grant camera and mic access, and start talking. The AI watches your camera feed, listens to your voice, thinks about both, and responds with synthesized speech. No cloud. No API keys. No server costs. Everything runs on your machine.

The stack is three models working together. Gemma 4 E2B handles speech understanding and vision through LiteRT-LM. Kokoro does the text-to-speech (MLX on macOS, ONNX on Linux). Silero VAD runs in the browser for voice activity detection. A FastAPI server ties it all together over WebSocket.

The numbers: on an M3 Pro, speech and vision processing takes 1.8-2.2 seconds, response generation about 0.3 seconds, and TTS another 0.3-0.7 seconds. End-to-end latency is 2.5-3.0 seconds. That's not instant, but it's conversational. You can even interrupt the AI mid-response.

The whole thing is about 2.6GB for the Gemma model plus TTS models. Python 3.12, Apache 2.0 license. 304 stars in one day, 26 forks already.

Why this matters: six months ago, running a multimodal agent locally was a research project. Now it's a weekend hack. The gap between cloud-only and local-first agent capabilities is collapsing faster than anyone expected. Every Mac with Apple Silicon is becoming an agent runtime, and Parlor is the clearest proof yet.

https://github.com/fikrikarim/parlor
← Previous
Microsoft Agent Framework 1.0: The Boring Release That Changes Everything
← Back to all articles

Comments

Loading...
>_