May 12, 2026ResearchBenchmarkMCP

ComplexMCP Tests Agents on 300 Tools Across 7 Stateful Sandboxes. The Best LLM Still Loses to a Human.

ComplexMCP dropped on arXiv yesterday. Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen. The framing — every existing MCP benchmark evaluates isolated API calls on toy fixtures. Real commercial automation requires agents to use 300+ tools across multiple stateful environments where actions have downstream consequences and APIs occasionally fail in plausible ways. ComplexMCP is the benchmark that builds that.

The specs. Over 300 tools spanning seven different stateful sandboxes designed to mimic real commercial software environments. Seed-driven architecture so each evaluation run uses a deterministic but non-trivial environment state. Built-in API failure injection so the agent has to recover, not just retry. Cross-sandbox tasks so the agent has to track state and dependencies across multiple tool surfaces.

The headline number is brutal. Top LLM success rate is below 60%. Human performance is around 90%. The 30-point gap is not coming from raw reasoning shortage — frontier models can solve each step in isolation. The gap is coming from the interaction with environment state, which is exactly where toy benchmarks have been hiding the problem.

The diagnostic part is the structural contribution. The paper identifies three concrete bottlenecks. Tool retrieval saturation — as the action space scales past 100+ tools, agents start picking semantically similar but functionally wrong tools more often. Agent over-confidence — agents skip environment verification (checking state before acting) and act on stale assumptions. Strategic defeatism — instead of recovering from a failed step, agents rationalize the failure into a partial success and stop trying. None of these show up cleanly in single-tool benchmarks.

Where this fits in the agent-eval cluster. SREGym, DELEGATE-52, Tool-Use Tax, LongSeeker, Instrumental Choices, PrefixGuard — the past 30 days have produced six structurally different production-reliability benchmarks. ComplexMCP slots in as the MCP-native one, scoping the eval directly to the protocol that has become the canonical agent-tool interface. Top LLMs at 60% on a 300-tool MCP sandbox is the number that should govern enterprise rollout plans for the next six months. arxiv.org/abs/2605.10787.
← Previous
Cactus Compute Distilled Gemini Tool Calling Into a 26 Million Parameter Model. It Runs on a Watch.
Next →
Super User Daily: 2026-05-13
← Back to all articles

Comments

Loading...
>_