June 23, 2026ResearchBenchmarkAgents

PlanBench-XL breaks agents by breaking their tools

PlanBench-XL is the top paper on HuggingFace today, and it goes after the assumption underneath every agent demo: that the tools will be there when the agent reaches for them. The benchmark drops an agent into 327 retail tasks spread across 1,665 tools, where it has to discover which tools it needs, call them to uncover clues for the next step, and plan over a long horizon, all under limited tool visibility, the way real systems actually work.

Then comes the cruel part. PlanBench-XL has an optional blocking mode that makes tools go missing, fail, or actively distract, simulating the messiness of production. And the numbers crater. GPT-5.4 hits 51.90% in clean conditions, which already isn't great, then collapses to 11.36% under the harshest blocking. The planning ability that looked fine in the demo evaporates the moment the environment stops cooperating.

This is the reality-check genre that's been quietly stacking up all spring: Agents' Last Exam, Where Do Deep-Research Agents Go Wrong, AdaPlanBench. The throughline is consistent and uncomfortable: agents look competent on clean benchmarks and fall apart on disruption, adaptation, and recovery. The hard part of agent work was never the happy path. It's what happens when a tool 404s mid-task.

Worth pairing this with Sakana Fugu landing the same week. One paper says orchestration collapses under tool failure, one product says orchestration is the path to the frontier. Both can't be fully right. Paper is arXiv 2606.22388, code at github.com/JiayuJeff/PlanBench-XL.
← Previous
Sakana Fugu sells orchestration as the model
Next β†’
Super User Daily: June 24, 2026
← Back to all articles

Comments

Loading...
>_