April 8, 2026BenchmarkAgentsResearch

Claw-Eval: Finally, an Agent Benchmark That Doesn't Lie

Most agent benchmarks test whether a model can answer questions. Claw-Eval tests whether it can actually do things. And it's today's most upvoted paper on HuggingFace with 326 votes, which tells you the community was hungry for this.

Built by teams at Peking University and the University of Hong Kong, Claw-Eval throws 139 tasks across 15 real services at agents: calendar management, file operations, web search, code execution, financial analysis, email processing. Everything runs in Docker sandboxes for full reproducibility. The key innovation is Pass-cubed: a task only counts as passed if the agent succeeds across three independent trials. No more lucky shots inflating leaderboards.

Version 1.1.0, just released, adds 35 multimodal agentic tasks where agents have to perceive visual information, reason about it, and deliver results. Twenty-three models are already on the leaderboard, with Step 3.5 Flash and GLM-5 neck-and-neck at 70.2% Pass-at-3. The safety evaluation is especially interesting, with scores ranging from 93.3% down to much lower for some open-source models.

What makes this different from the dozen other agent benchmarks out there is the transparency commitment. Every task is human-verified. The codebase is being community-audited. They're not just publishing numbers, they're publishing the machinery that produces the numbers so anyone can verify.

Agent evaluation has been the weak link in the entire ecosystem. You can build the most sophisticated agent framework in the world, but if your benchmark lets mediocre models pass by luck, you're flying blind. Claw-Eval is the first benchmark that takes this problem seriously at scale.

https://github.com/claw-eval/claw-eval
← Previous
Amazon S3 Files: The Infrastructure AI Agents Were Waiting For
← Back to all articles

Comments

Loading...
>_