April 11, 2026AgentsOpen SourceResearchTool

MolmoWeb: AI2 Open-Sources a Web Agent That Sees Like You Do

AI2 just dropped MolmoWeb and it changes the game for open-source web agents. An 8B model that operates a browser by looking at screenshots — the same way you do — and it outperforms agents built on GPT-4o on key web navigation tasks. No DOM parsing, no accessibility trees. Pure vision.

The model comes in 4B and 8B sizes, built on the Molmo 2 multimodal family. Given a task and a live webpage, it observes the page through screenshots, predicts the next step, and executes browser actions — clicking, typing, scrolling. The 8B version scores 78.2% on WebVoyager and 42.3% on DeepShop.

But the real gift is the data. MolmoWebMix includes 30,000 human task trajectories across 1,100+ websites, 590,000 subtask demonstrations, and 2.2 million screenshot QA pairs. AI2 calls it the largest publicly released collection of human web-task execution ever assembled. That dataset alone could spawn a dozen research projects.

Critically, MolmoWeb was trained without distilling from proprietary vision agents. The training data comes from synthetic trajectories generated by text-only accessibility-tree agents plus real human demonstrations. This means no legal gray areas around training on closed-model outputs.

Everything is open: weights, training code, eval harness, annotation tool, synthetic data pipeline, and demo client. If you believe the future of AI agents includes operating websites and apps through vision, this is the most complete open-source starting point that exists.

https://github.com/allenai/molmoweb
← Previous
Tencent Open-Sources HY-Embodied — A 2B Brain for Robot Agents
Next →
Kronos: The First Foundation Model That Reads Candlestick Charts
← Back to all articles

Comments

Loading...
>_