MolmoWeb: AI2 Open-Sources a Web Agent That Sees Like You Do
AI2 just dropped MolmoWeb and it changes the game for open-source web agents. An 8B model that operates a browser by looking at screenshots — the same way you do — and it outperforms agents built on GPT-4o on key web navigation tasks. No DOM parsing, no accessibility trees. Pure vision.
The model comes in 4B and 8B sizes, built on the Molmo 2 multimodal family. Given a task and a live webpage, it observes the page through screenshots, predicts the next step, and executes browser actions — clicking, typing, scrolling. The 8B version scores 78.2% on WebVoyager and 42.3% on DeepShop.
But the real gift is the data. MolmoWebMix includes 30,000 human task trajectories across 1,100+ websites, 590,000 subtask demonstrations, and 2.2 million screenshot QA pairs. AI2 calls it the largest publicly released collection of human web-task execution ever assembled. That dataset alone could spawn a dozen research projects.
Critically, MolmoWeb was trained without distilling from proprietary vision agents. The training data comes from synthetic trajectories generated by text-only accessibility-tree agents plus real human demonstrations. This means no legal gray areas around training on closed-model outputs.
Everything is open: weights, training code, eval harness, annotation tool, synthetic data pipeline, and demo client. If you believe the future of AI agents includes operating websites and apps through vision, this is the most complete open-source starting point that exists.
https://github.com/allenai/molmoweb
← Back to all articles
The model comes in 4B and 8B sizes, built on the Molmo 2 multimodal family. Given a task and a live webpage, it observes the page through screenshots, predicts the next step, and executes browser actions — clicking, typing, scrolling. The 8B version scores 78.2% on WebVoyager and 42.3% on DeepShop.
But the real gift is the data. MolmoWebMix includes 30,000 human task trajectories across 1,100+ websites, 590,000 subtask demonstrations, and 2.2 million screenshot QA pairs. AI2 calls it the largest publicly released collection of human web-task execution ever assembled. That dataset alone could spawn a dozen research projects.
Critically, MolmoWeb was trained without distilling from proprietary vision agents. The training data comes from synthetic trajectories generated by text-only accessibility-tree agents plus real human demonstrations. This means no legal gray areas around training on closed-model outputs.
Everything is open: weights, training code, eval harness, annotation tool, synthetic data pipeline, and demo client. If you believe the future of AI agents includes operating websites and apps through vision, this is the most complete open-source starting point that exists.
https://github.com/allenai/molmoweb
Comments