OpenAI Just Killed SWE-bench Verified
OpenAI quietly buried the most-cited coding agent benchmark of the last two years.
In a post on April 27, OpenAI's own evals team said they have stopped reporting SWE-bench Verified scores, and that other model developers should stop too. Two reasons. First, 59.4% of audited problems have flawed test cases, meaning correct fixes get rejected and wrong fixes pass. Second, frontier models can reproduce the original human-written bug fixes verbatim, which means they were trained on the answer set. So a higher score now mostly tells you how much the model saw the test during training, not how well it codes.
The pivot: OpenAI now recommends SWE-bench Pro from Scale, with 1865 tasks across 41 actively maintained repos in Python, Go, TypeScript and JavaScript, sourced from real commit histories that frontier models have not seen. The gap is real. Claude Opus 4.5 hits 80.9% on Verified and only 45.9% on Pro under the same scaffolding. Half the apparent skill was leakage.
This is the kind of admission you do not get often from a frontier lab. It also lines up with Latent Space's interview with OpenAI's Mia Glaese and Olivia Watkins titled bluntly The End of SWE-Bench Verified. The whole industry was writing PRs to chase a number that was never measuring what it claimed to measure.
For anyone shipping coding agents, the takeaway is uncomfortable. If you optimized your fine-tunes, your RL setup, or your harness against Verified scores, you optimized against a saturated and contaminated target. The next leaderboard cycle is going to be brutal because Pro scores are roughly half of Verified scores for the same model, and there is no shortcut left. Read OpenAI's full writeup at openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ and the Pro leaderboard at labs.scale.com/leaderboard/swe_bench_pro_public.
← Back to all articles
In a post on April 27, OpenAI's own evals team said they have stopped reporting SWE-bench Verified scores, and that other model developers should stop too. Two reasons. First, 59.4% of audited problems have flawed test cases, meaning correct fixes get rejected and wrong fixes pass. Second, frontier models can reproduce the original human-written bug fixes verbatim, which means they were trained on the answer set. So a higher score now mostly tells you how much the model saw the test during training, not how well it codes.
The pivot: OpenAI now recommends SWE-bench Pro from Scale, with 1865 tasks across 41 actively maintained repos in Python, Go, TypeScript and JavaScript, sourced from real commit histories that frontier models have not seen. The gap is real. Claude Opus 4.5 hits 80.9% on Verified and only 45.9% on Pro under the same scaffolding. Half the apparent skill was leakage.
This is the kind of admission you do not get often from a frontier lab. It also lines up with Latent Space's interview with OpenAI's Mia Glaese and Olivia Watkins titled bluntly The End of SWE-Bench Verified. The whole industry was writing PRs to chase a number that was never measuring what it claimed to measure.
For anyone shipping coding agents, the takeaway is uncomfortable. If you optimized your fine-tunes, your RL setup, or your harness against Verified scores, you optimized against a saturated and contaminated target. The next leaderboard cycle is going to be brutal because Pro scores are roughly half of Verified scores for the same model, and there is no shortcut left. Read OpenAI's full writeup at openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ and the Pro leaderboard at labs.scale.com/leaderboard/swe_bench_pro_public.
Comments