April 14, 2026BenchmarkAgentsResearch

CocoaBench: Best AI Agent Scores 45%. That's the Best.

Most agent benchmarks test one thing at a time. Can the agent browse? Can it code? Can it use tools? CocoaBench asks a harder question: can it do all of them together on real tasks that humans actually care about?

The benchmark comes from a team of 30+ researchers and is built from human-designed, long-horizon tasks that require agents to flexibly combine vision, web search, and coding. No hand-holding. Each task is just an instruction and an automatic evaluation function. The agent figures out the rest.

The results are sobering. The best evaluated system hit 45.1% success rate. Not 45% on the hard tasks. 45% overall. Current agents still fail more than they succeed on tasks that require combining multiple capabilities in open environments.

The paper also ships Cocoa-Agent, a lightweight shared scaffold so you can do controlled comparisons across different model backbones. The analysis breaks down where agents fail: reasoning and planning, tool use and execution, and visual grounding are all significant bottlenecks.

This is exactly the kind of benchmark the field needs. Single-capability tests make agents look better than they are. CocoaBench shows where we actually stand when the training wheels come off.

https://arxiv.org/abs/2604.11201
← Previous
ElevenLabs Guardrails 2.0: Three Walls Between Your Voice Agent and Chaos
Next β†’
CodeTracer: Finally, a Way to Debug AI Agents That Actually Debug Code
← Back to all articles

Comments

Loading...
>_