GPT-5.5 kills the benchmark
OpenAI pushed GPT-5.5 yesterday and the numbers make everything before it feel like last year. Terminal-Bench 2.0: 82.7 percent. GPT-5.4 was at 75.1, Claude Opus 4.7 at 69.4. Expert-SWE: 73.1. BrowseComp with GPT-5.5 Pro hits 90.1 against Gemini 3.1 Pro at 85.9. Artificial Analysis Intelligence Index shows GPT-5.5 at 60, three full points ahead of Opus 4.7 and Gemini 3.1 Pro Preview sitting on 57.
The agent story is where it gets wild. GDPval, which tests models on real-world economically valuable work, puts GPT-5.5 almost 5 points clear of Opus 4.7. XBOW says in their pentesting eval the model’s vulnerability miss rate dropped from 40 percent on GPT-5 to 10 percent on GPT-5.5, and black-box GPT-5.5 without source code now beats GPT-5 with full code access. They literally wrote that GPT-5.5 killed their benchmark. That’s Mythos-level capability shipping to every Plus subscriber.
API context is 1M tokens, Codex gets 400K, long-context retrieval jumps to 74 percent. And critically, per-token latency matches GPT-5.4 while using fewer tokens for the same task. Sam Altman’s framing is that GPT-5.5 brings OpenAI one step closer to a true AI super app, and for once the marketing line actually aligns with the numbers.
For anyone building agents the calculus just shifted. Claude Opus 4.7 was the default for hard agentic coding two months ago. Now Anthropic has the postmortem about quality regressions and OpenAI has a model topping every benchmark it ships against. The pendulum is back on OpenAI’s side for the first time since late 2025. Expect a response from Anthropic before the end of May.
https://openai.com/index/introducing-gpt-5-5/
← Back to all articles
The agent story is where it gets wild. GDPval, which tests models on real-world economically valuable work, puts GPT-5.5 almost 5 points clear of Opus 4.7. XBOW says in their pentesting eval the model’s vulnerability miss rate dropped from 40 percent on GPT-5 to 10 percent on GPT-5.5, and black-box GPT-5.5 without source code now beats GPT-5 with full code access. They literally wrote that GPT-5.5 killed their benchmark. That’s Mythos-level capability shipping to every Plus subscriber.
API context is 1M tokens, Codex gets 400K, long-context retrieval jumps to 74 percent. And critically, per-token latency matches GPT-5.4 while using fewer tokens for the same task. Sam Altman’s framing is that GPT-5.5 brings OpenAI one step closer to a true AI super app, and for once the marketing line actually aligns with the numbers.
For anyone building agents the calculus just shifted. Claude Opus 4.7 was the default for hard agentic coding two months ago. Now Anthropic has the postmortem about quality regressions and OpenAI has a model topping every benchmark it ships against. The pendulum is back on OpenAI’s side for the first time since late 2025. Expect a response from Anthropic before the end of May.
https://openai.com/index/introducing-gpt-5-5/
Comments