May 16, 2026ResearchBenchmarkAgents

A New Way to Grade Agents That Beats Frontier Models as Judges

Holistic Evaluation and Failure Diagnosis of AI Agents landed on arXiv May 14 (2605.14865). Fifteen authors. The paper takes apart the monolithic-LLM-as-judge approach for evaluating agent runs and proposes a framework that decomposes traces into span-level assessments with explicit rationales for each verdict. The verdict is not just pass/fail but where and why the failure occurred.

Numbers are loud. Up to 38% relative gain on category F1 over monolithic baselines. 3.5x improvement in localization accuracy, meaning the framework correctly identifies the offending span 3.5 times more often than a frontier model used as a flat judge. 12.5x improvement in joint localization plus categorization accuracy. Sets state of the art on GAIA and SWE-Bench evaluation. The key finding the paper hammers: the same frontier model achieves several times higher localization accuracy when used inside this framework than as a monolithic judge. Methodology, not model capability, was the limiting factor.

Why this matters: agent evaluation has been the slowest-moving piece of the agent stack. Everyone trains agents on GAIA or SWE-Bench, then evaluates with a frontier model as judge, then complains that the judge is unreliable. If this method generalizes, agent dev loops get a meaningfully sharper feedback signal without swapping models. It pairs structurally with last week's Judgment Labs funding and AgentRail launch as the agent-observability stack maturing in parallel with agent capability.

Paper at arxiv.org/abs/2605.14865. No GitHub URL in the abstract.
← Previous
Orthrus Makes Qwen3 Run 5x Faster Without Touching the Output
Next β†’
LEMON Trains the Orchestrator Instead of Configuring It
← Back to all articles

Comments

Loading...
>_