Agents' Last Exam: The Best Agent Passes 2.6% of the Hard Stuff
On the same day Anthropic claimed state-of-the-art on nearly every benchmark, a Berkeley team led by Dawn Song dropped the benchmark that doesn't care. Agents' Last Exam takes 1,490 real professional tasks, built with 250+ industry experts and grounded in the U.S. federal occupational taxonomy, across 55 subfields and 13 industry clusters. These aren't toy puzzles. They're the actual economically valuable work people get paid for.
The scores are humbling. The strongest setup they tested, Codex with GPT-5.5, passes 26.2% overall. On the hardest Last-Exam tier, the average across every harness and model is 2.6%. Computational math and agriculture clear 60%, while visual media and education sit under 30%. And here's a detail that should bother anyone building agents: the models underuse GUIs even when the task demands clicking, defaulting to the command line because that's what they're comfortable with.
The framing is the point. The authors argue the gap between benchmark wins and real deployment isn't a capability problem, it's an evaluation problem. We keep scoring agents on things that don't look like work. ALE is built as a living benchmark, with a public submission portal so the task pool keeps growing as agents catch up.
Put this next to Fable 5 and you get the honest picture of June 2026: the models are genuinely SOTA, and they still fail three out of four real jobs. Both things are true. Anyone telling you agents are about to replace knowledge workers wholesale hasn't looked at the 2.6%. Link: https://agents-last-exam.org/
← Back to all articles
The scores are humbling. The strongest setup they tested, Codex with GPT-5.5, passes 26.2% overall. On the hardest Last-Exam tier, the average across every harness and model is 2.6%. Computational math and agriculture clear 60%, while visual media and education sit under 30%. And here's a detail that should bother anyone building agents: the models underuse GUIs even when the task demands clicking, defaulting to the command line because that's what they're comfortable with.
The framing is the point. The authors argue the gap between benchmark wins and real deployment isn't a capability problem, it's an evaluation problem. We keep scoring agents on things that don't look like work. ALE is built as a living benchmark, with a public submission portal so the task pool keeps growing as agents catch up.
Put this next to Fable 5 and you get the honest picture of June 2026: the models are genuinely SOTA, and they still fail three out of four real jobs. Both things are true. Anyone telling you agents are about to replace knowledge workers wholesale hasn't looked at the 2.6%. Link: https://agents-last-exam.org/
Comments