A New Way to Grade Agents That Beats Frontier Models as Judges
Holistic Evaluation and Failure Diagnosis of AI Agents landed on arXiv May 14 (2605.14865). Fifteen authors. The paper takes apart the monolithic-LLM-as-judge approach for evaluating agent runs and proposes a framework that decomposes traces into span-level assessments with explicit rationales for each verdict. The verdict is not just pass/fail but where and why the failure occurred.
Numbers are loud. Up to 38% relative gain on category F1 over monolithic baselines. 3.5x improvement in localization accuracy, meaning the framework correctly identifies the offending span 3.5 times more often than a frontier model used as a flat judge. 12.5x improvement in joint localization plus categorization accuracy. Sets state of the art on GAIA and SWE-Bench evaluation. The key finding the paper hammers: the same frontier model achieves several times higher localization accuracy when used inside this framework than as a monolithic judge. Methodology, not model capability, was the limiting factor.
Why this matters: agent evaluation has been the slowest-moving piece of the agent stack. Everyone trains agents on GAIA or SWE-Bench, then evaluates with a frontier model as judge, then complains that the judge is unreliable. If this method generalizes, agent dev loops get a meaningfully sharper feedback signal without swapping models. It pairs structurally with last week's Judgment Labs funding and AgentRail launch as the agent-observability stack maturing in parallel with agent capability.
Paper at arxiv.org/abs/2605.14865. No GitHub URL in the abstract.
← Back to all articles
Numbers are loud. Up to 38% relative gain on category F1 over monolithic baselines. 3.5x improvement in localization accuracy, meaning the framework correctly identifies the offending span 3.5 times more often than a frontier model used as a flat judge. 12.5x improvement in joint localization plus categorization accuracy. Sets state of the art on GAIA and SWE-Bench evaluation. The key finding the paper hammers: the same frontier model achieves several times higher localization accuracy when used inside this framework than as a monolithic judge. Methodology, not model capability, was the limiting factor.
Why this matters: agent evaluation has been the slowest-moving piece of the agent stack. Everyone trains agents on GAIA or SWE-Bench, then evaluates with a frontier model as judge, then complains that the judge is unreliable. If this method generalizes, agent dev loops get a meaningfully sharper feedback signal without swapping models. It pairs structurally with last week's Judgment Labs funding and AgentRail launch as the agent-observability stack maturing in parallel with agent capability.
Paper at arxiv.org/abs/2605.14865. No GitHub URL in the abstract.
Comments