April 10, 2026Benchmark Agents Research

ClawBench: Agents Score 33% on Tasks You Do Every Day

Here is a number that should worry anyone building AI agents for consumers: Claude Sonnet 4.6, the best-performing model, scores only 33.3% on ClawBench, a benchmark of 153 everyday online tasks across 144 real websites. GPT-5.4 manages just 6.5%.

ClawBench is different from existing benchmarks because it runs on live production websites, not sandboxed simulations with static HTML. The tasks are things normal people do regularly: purchasing products, booking appointments, submitting job applications, filling out detailed forms. The benchmark uses a lightweight interception layer that captures and blocks only the final submission request, so agents interact with real sites without actually completing transactions.

The gap between benchmark scores and real-world performance is striking. These same models score 65-75% on traditional web benchmarks like WebVoyager. But when you drop them onto actual websites with real forms, dynamic content, and multi-step workflows, performance craters. The paper evaluates 7 frontier models and none break 35%.

This matters because the industry is rushing to deploy web agents for consumers. ClawBench shows we are not ready. The next time someone tells you their agent can browse the web and get things done, ask them what their ClawBench score is.

https://claw-bench.com

← Previous

Twill.ai: YC-Backed Agents That Ship PRs While You Sleep

Pomo Raises $4.5M to Bring Agent Intelligence to Marketing Decisions

← Back to all articles

ClawBench: Agents Score 33% on Tasks You Do Every Day

More Articles

Comments