GLM-5V-Turbo Bets Multimodal Is the Default
The GLM-V Team dropped GLM-5V-Turbo on arXiv (2604.26752) and HN picked it up at 86 points today. The thesis in one line: multimodal perception is not an add-on to language reasoning, it is the core. The team treats vision-language as a single substrate for reasoning, planning, tool use, and execution — not a vision encoder bolted onto a chat model.
The training stack is the structurally interesting part. They scaled multimodal RL alongside multimodal pre-training rather than after, and built an extended toolchain so the model can call vision tools as first-class citizens. The result is strong multimodal coding, visual tool use, and framework-based agentic task scores while preserving competitive text-only coding capability — usually the trade-off goes the other way. Held-out evals on heterogeneous contexts (images, video, webpages, documents, GUIs) show the same pattern.
This is GLM, the same lab behind GLM-5V earlier this year. Wenyi Hong leads, 76 contributors. They've been quietly shipping every quarter while everyone else hypes Gemini. GLM-5V hit FutureX leaderboards two months ago. GLM-5V-Turbo is the ablated, faster, more deployable version. Open weights expected based on prior releases — unlike Gemini Robotics ER, which is closed.
For the agent thesis this matters. Computer-use agents need to read screens, parse documents, watch videos. Pure-text frontier models hit a ceiling on long-horizon GUI tasks. GLM-5V-Turbo's pitch — multimodal-as-foundation, with tool use baked into the training, not stitched in via prompting — is the same bet Manus's My Computer made on the systems side and what MolmoAct2 made on the embodied side. Three teams converging on the same architectural argument inside two weeks. The "language model with vision adapter" era is closing.
Paper: arxiv.org/abs/2604.26752
← Back to all articles
The training stack is the structurally interesting part. They scaled multimodal RL alongside multimodal pre-training rather than after, and built an extended toolchain so the model can call vision tools as first-class citizens. The result is strong multimodal coding, visual tool use, and framework-based agentic task scores while preserving competitive text-only coding capability — usually the trade-off goes the other way. Held-out evals on heterogeneous contexts (images, video, webpages, documents, GUIs) show the same pattern.
This is GLM, the same lab behind GLM-5V earlier this year. Wenyi Hong leads, 76 contributors. They've been quietly shipping every quarter while everyone else hypes Gemini. GLM-5V hit FutureX leaderboards two months ago. GLM-5V-Turbo is the ablated, faster, more deployable version. Open weights expected based on prior releases — unlike Gemini Robotics ER, which is closed.
For the agent thesis this matters. Computer-use agents need to read screens, parse documents, watch videos. Pure-text frontier models hit a ceiling on long-horizon GUI tasks. GLM-5V-Turbo's pitch — multimodal-as-foundation, with tool use baked into the training, not stitched in via prompting — is the same bet Manus's My Computer made on the systems side and what MolmoAct2 made on the embodied side. Three teams converging on the same architectural argument inside two weeks. The "language model with vision adapter" era is closing.
Paper: arxiv.org/abs/2604.26752
Comments