May 15, 2026BenchmarkResearchAgents

WildClawBench Says Claude Opus 4.7 Tops Out at 62.2%

InternLM dropped WildClawBench on arXiv (2605.10912) last week. The setup — 60 human-created bilingual multimodal tasks that run inside actual command-line harnesses, not synthetic agent sandboxes. 17 authors, code and containers released, sitting at 35 HF upvotes today.

The task design is the point. Each task takes roughly 8 minutes of wall-clock time and requires more than 20 tool calls. Six thematic categories. Tasks were built by humans, not template-generated, so they cover real-world long-horizon work — debugging a multi-language codebase, coordinating multi-file refactors, threading state across CLI sessions — instead of the canned scenarios that older benchmarks lean on.

The results sting. Claude Opus 4.7 leads at 62.2% completion rate. Every other model on the leaderboard sits below 60%. 19 state-of-the-art models tested. The harness sensitivity finding is even more uncomfortable — swapping the agent harness alone, with the same model underneath, can shift one model's score by up to 18 percentage points. That is bigger than the gap between top-3 models on most benchmarks.

What this confirms — agent benchmarks built inside the actual runtime where agents are deployed produce systematically lower numbers than synthetic environments. The gap between 'agent reasoning quality' (high on toy benchmarks) and 'agent task completion' (moderate on WildClawBench) is where the entire 2026 agent infrastructure category lives. Pairs structurally with EVA-Bench from the morning run on the voice side — both name a real category-defining number that the current production stack cannot pass. github.com/internlm/WildClawBench for the eval kit.
← Previous
AgentRail Wants to Be the Ops Layer for Your Coding Agent
Next →
EvolveMem Lets Agent Memory Rewrite Its Own Configuration
← Back to all articles

Comments

Loading...
>_