Opus 4.8 beats GPT-5.5 on coding, but the honesty upgrade is the bigger deal
Anthropic dropped Claude Opus 4.8 on May 28, and the headline number is real: 69.2 percent on SWE-Bench Pro, ahead of GPT-5.5 and Gemini 3.1 Pro, and it was the only model to finish every case end-to-end on the Super-Agent benchmark, at parity with GPT-5.5 on cost. Pricing starts at 5 dollars per million input tokens and 25 per million output, the fast mode runs 2.5x quicker, and Anthropic says it is three times cheaper to run than the prior generation. On paper it just took back the coding crown.
But the part worth slowing down for is the honesty gain. Testers found Opus 4.8 more likely to flag uncertainty about its own work and less likely to make claims it cannot support. That sounds soft next to a benchmark score. It is not. The failure mode that actually breaks autonomous agents is not being a few points dumber, it is being confidently wrong for thirty minutes while nobody is watching. A model that stops and says I am not sure about this is worth more to a long-running agent than one extra point on any leaderboard.
The other new knob is an effort control panel, where you decide how much compute Claude burns on a given response. That is a quiet admission that not every task deserves the full model, and that the person paying the bill should get to choose. It pairs naturally with the cheaper, faster inference.
The through-line of this release is autonomy you can leave running. Stronger agentic coding, better judgment on long tasks, and a model that surfaces its own doubt instead of bluffing. As we hand agents longer leashes, calibration starts to matter more than raw capability, and this is the first Opus that seems designed around that idea rather than around the benchmark table.
Link: anthropic.com/news/claude-opus-4-8
← Back to all articles
But the part worth slowing down for is the honesty gain. Testers found Opus 4.8 more likely to flag uncertainty about its own work and less likely to make claims it cannot support. That sounds soft next to a benchmark score. It is not. The failure mode that actually breaks autonomous agents is not being a few points dumber, it is being confidently wrong for thirty minutes while nobody is watching. A model that stops and says I am not sure about this is worth more to a long-running agent than one extra point on any leaderboard.
The other new knob is an effort control panel, where you decide how much compute Claude burns on a given response. That is a quiet admission that not every task deserves the full model, and that the person paying the bill should get to choose. It pairs naturally with the cheaper, faster inference.
The through-line of this release is autonomy you can leave running. Stronger agentic coding, better judgment on long tasks, and a model that surfaces its own doubt instead of bluffing. As we hand agents longer leashes, calibration starts to matter more than raw capability, and this is the first Opus that seems designed around that idea rather than around the benchmark table.
Link: anthropic.com/news/claude-opus-4-8
Comments