The Benchmark That Asks If Your Code Would Actually Get Merged
Cognition, the Devin people, fresh off a billion-dollar raise at a 26 billion valuation, just shipped FrontierCode, and it's the first coding benchmark that measures something other than does the test pass. It measures mergeability. Would the maintainer of this repo actually accept this code? That means test quality, scope discipline, code style, and adherence to the project's own conventions. The stuff that separates a real engineer from a code-vending machine.
The numbers are humbling. More than 20 world-class open-source maintainers built tasks from repos they actually maintain, spending 40-plus hours per task. On the hardest tier, FrontierCode Diamond, the best model on earth, Claude Opus 4.8, scores 13.4%. GPT-5.5 gets 6.3%. Gemini 3.1 Pro gets 4.7%. The benchmark is nowhere near saturated, which is exactly the point.
Why it matters: every coding benchmark we've been celebrating, SWE-bench and its cousins, measures correctness, and models have been crushing those for a year. FrontierCode says correctness was the easy part. The hard part is writing code a senior engineer would sign off on without rewriting it. If you believe agents are going to do real software engineering, this is the gap that has to close, and right now it's a chasm.
It's also honest of Cognition to publish a benchmark that makes everyone look bad, including Claude Opus 4.8, the model their own Devin runs on. A 13.4% top score is the most useful thing a coding-agent company could tell you right now: we are not nearly done. Link: https://cognition.ai/blog/frontier-code
← Back to all articles
The numbers are humbling. More than 20 world-class open-source maintainers built tasks from repos they actually maintain, spending 40-plus hours per task. On the hardest tier, FrontierCode Diamond, the best model on earth, Claude Opus 4.8, scores 13.4%. GPT-5.5 gets 6.3%. Gemini 3.1 Pro gets 4.7%. The benchmark is nowhere near saturated, which is exactly the point.
Why it matters: every coding benchmark we've been celebrating, SWE-bench and its cousins, measures correctness, and models have been crushing those for a year. FrontierCode says correctness was the easy part. The hard part is writing code a senior engineer would sign off on without rewriting it. If you believe agents are going to do real software engineering, this is the gap that has to close, and right now it's a chasm.
It's also honest of Cognition to publish a benchmark that makes everyone look bad, including Claude Opus 4.8, the model their own Devin runs on. A 13.4% top score is the most useful thing a coding-agent company could tell you right now: we are not nearly done. Link: https://cognition.ai/blog/frontier-code
Comments