ClawBench: Agents Score 33% on Tasks You Do Every Day
Here is a number that should worry anyone building AI agents for consumers: Claude Sonnet 4.6, the best-performing model, scores only 33.3% on ClawBench, a benchmark of 153 everyday online tasks across 144 real websites. GPT-5.4 manages just 6.5%.
ClawBench is different from existing benchmarks because it runs on live production websites, not sandboxed simulations with static HTML. The tasks are things normal people do regularly: purchasing products, booking appointments, submitting job applications, filling out detailed forms. The benchmark uses a lightweight interception layer that captures and blocks only the final submission request, so agents interact with real sites without actually completing transactions.
The gap between benchmark scores and real-world performance is striking. These same models score 65-75% on traditional web benchmarks like WebVoyager. But when you drop them onto actual websites with real forms, dynamic content, and multi-step workflows, performance craters. The paper evaluates 7 frontier models and none break 35%.
This matters because the industry is rushing to deploy web agents for consumers. ClawBench shows we are not ready. The next time someone tells you their agent can browse the web and get things done, ask them what their ClawBench score is.
https://claw-bench.com
← Back to all articles
ClawBench is different from existing benchmarks because it runs on live production websites, not sandboxed simulations with static HTML. The tasks are things normal people do regularly: purchasing products, booking appointments, submitting job applications, filling out detailed forms. The benchmark uses a lightweight interception layer that captures and blocks only the final submission request, so agents interact with real sites without actually completing transactions.
The gap between benchmark scores and real-world performance is striking. These same models score 65-75% on traditional web benchmarks like WebVoyager. But when you drop them onto actual websites with real forms, dynamic content, and multi-step workflows, performance craters. The paper evaluates 7 frontier models and none break 35%.
This matters because the industry is rushing to deploy web agents for consumers. ClawBench shows we are not ready. The next time someone tells you their agent can browse the web and get things done, ask them what their ClawBench score is.
https://claw-bench.com
Comments