May 1, 2026BenchmarkResearchAgents

Claw-Eval-Live: A Benchmark That Refreshes With the Real World

A new agent benchmark called Claw-Eval-Live landed on May 1 with the obvious-but-rarely-done idea: agents in production face workflows that change, so the eval should change too. 105 tasks across business services and local workspace repair, 13 frontier models tested under shared public pass rules, and a refreshable signal layer that pulls in new public workflow demand across releases.

The top-line number is what makes the paper worth reading. The leading model passes 66.7% of tasks. No model crosses 70%. The persistent failure modes cluster around HR, management, and multi-system business workflows. Local workspace repair — the kind of single-app surgical fix coding agents are built for — is comparatively easy. The hard part isn't the per-tool reasoning. The hard part is the cross-system and cross-stakeholder coordination, exactly the part that humans actually get paid to do.

The methodology choice that makes this real: deterministic checks when the evidence is concrete, structured LLM-judging only for the semantic dimensions, evaluation grounded in execution traces and audit logs and service states and workspace artifacts. Not just the final response. Verifiable agent action all the way down. This is the right pattern. Most existing benchmarks score the answer and skip the trail.

Claw-Eval-Live is the third major eval paper in three weeks attacking the same wall — Synthetic Computers at Scale, WindowsWorld, now this. The pattern is clear: the evaluation crisis the field has been talking about for six months is becoming an actual research program with concrete deliverables. Worth tracking the leaderboard at claw-eval-live.github.io.

https://arxiv.org/abs/2604.28139
← Previous
A Paper Just Argued LangGraph and CrewAI Are Already Obsolete
Next →
OpenAI Hands Out Hardware Keys. The Trusted Access Tier Just Got Operational
← Back to all articles

Comments

Loading...
>_