March 31, 2026Open SourceAgentsInfrastructure

Qwen3.5-Omni: The Agent That Sees, Hears, and Speaks 36 Languages

Alibaba's Qwen team dropped Qwen3.5-Omni on March 30, and it's not a minor upgrade. This is a fully multimodal model that processes text, images, audio, and video simultaneously — and talks back in real time across 36 languages. Speech recognition covers 113 languages and dialects, up from 19 in the previous version. The model ships in three sizes: Plus, Flash, and Light, all supporting 256K-token context. It was trained on over 100 million hours of audio-visual data.

The architecture is a dual-component design. The Thinker handles reasoning across all input modalities using a Hybrid-Attention MoE design. The Talker converts those representations into streaming speech tokens. It's not "text-to-speech stapled onto an LLM" — both components work natively end-to-end. Qwen claims 215 SOTA results across benchmarks, outperforming Gemini 2.5 Pro and GPT-4o on 32 out of 36 audio and audio-visual tasks.

Two features stand out for agent builders. First, semantic interruption: the model can distinguish between "uh-huh" and an actual attempt to interrupt, so voice agents don't stop mid-thought every time someone coughs. Second, native tool use during audio — the model can search the web, call APIs, and execute functions while maintaining a voice conversation. That's the voice agent trifecta: see, hear, act.

The model is fully open source on GitHub and already available on Ollama. For anyone building multimodal agents, this is the most capable open-source option available right now. The fact that Ollama's MLX integration launched the same day with Qwen3.5-35B as the showcase model is not a coincidence.

https://qwen.ai/blog?id=qwen3.5
https://github.com/QwenLM/Qwen3-Omni
← Previous
Ollama Goes MLX — Local Agents Just Got Twice as Fast
← Back to all articles

Comments

Loading...
>_