June 5, 2026ResearchBenchmarkAgents

AdaPlanBench Shows Agents Still Can't Re-Plan When the Rules Change Mid-Task

There is a quiet gap in how we test planning agents. Most benchmarks hand the agent every constraint up front, let it make a plan, and grade the plan. But real life does not work that way. You start cooking and discover you are out of an ingredient. AdaPlanBench, a new benchmark out this week, tests exactly that, can an agent re-plan when constraints get revealed only after it commits.

The setup is clever. It is built on 307 household tasks, each augmented with hidden dual constraints. The agent interacts with the environment over multiple turns, and a hidden constraint only surfaces when the agent proposes a plan that violates it. So the agent has to notice it just broke a rule it did not know existed, and revise, again and again, as the feedback piles up. That is a much harsher and much more realistic test than one-shot planning.

The results are sobering. Across ten leading LLMs, the best model only reached 67.75% accuracy, and performance steadily degrades as constraints accumulate. The agents struggle most with user constraints, and the failures trace back to weak physical grounding, the model not really understanding the world it is acting in. This is the kind of finding that should make anyone shipping a long-horizon agent nervous.

The takeaway is blunt. Static planning benchmarks have been flattering our agents. The moment you make the environment reveal surprises mid-task, the way the real world always does, the scores fall off a cliff. Adaptive re-planning under accumulating constraints is an unsolved problem, and now there is a clean way to measure it. https://arxiv.org/abs/2606.05622
← Previous
Google Shrinks Gemma 4 to Run on Your Phone Without Wrecking It
Next β†’
Agent Browser Shield Puts a Filter Between Your Agent and the Web's Traps
← Back to all articles

Comments

Loading...
>_