April 26, 2026ResearchBenchmarkAgents

DIVERT — A Paper That Pairs With OpenAI's Benchmark Funeral

Same week OpenAI tells the world to stop trusting SWE-bench Verified, this paper from IBM Research lands on arXiv with a different angle on the same problem. If your benchmark is broken, the cheap way to evaluate an agent is to simulate users against it. The expensive way is to do that without burning a fortune on tokens.

DIVERT, by Itay Nakash, George Kour and Ateret Anaby-Tavor at IBM, stands for Diversity-Induced Evaluation via Branching of Trajectories. Standard practice for evaluating customer-facing LLM agents is linear rollout: simulate a user, run a conversation end to end, see if the agent fails. Repeat. The problem is that nine out of ten conversations share the same opening, the same boilerplate, the same first three turns. You pay for those turns every time. DIVERT does snapshot-based, coverage-guided simulation that branches at decision points, reuses the prefix of every conversation that diverged from the same place, and uses diversity as a signal for where to spend simulation budget.

The punchline: more failures discovered per token compared to linear rollout, and a wider set of tasks where failures actually surface. Translation: you can find more bugs in your agent for less money. Also translation: most of the agent eval pipelines running in production right now are wasting compute on redundant prefixes.

The pairing with the SWE-bench story is the real insight. The current eval crisis has two faces: the static benchmark face, where you saturate and contaminate fixed test sets, and the dynamic eval face, where you pay too much for too little coverage. OpenAI's call kills the first. DIVERT cuts the second by maybe an order of magnitude. The next twelve months in agent eval are going to be about graph-structured user simulation, not bigger fixed leaderboards. Paper at arxiv.org/abs/2604.21480.
← Previous
Ace — GitHub's Bet That Coding Agents Need a Multiplayer Mode
Next →
Super User Daily: 2026-04-27
← Back to all articles

Comments

Loading...
>_