April 14, 2026Benchmark Agents Research

N-Day-Bench: Can Your LLM Actually Find Bugs, or Just Talk About Them?

There are benchmarks for code generation, code review, even code explanation. N-Day-Bench asks a more pointed question: can your LLM find real security vulnerabilities in real codebases, the ones disclosed after its training cutoff?

The setup is straightforward. Models get 24 shell steps to explore actual vulnerable code from GitHub security advisories, then write a structured vulnerability report. No patches shown. No hints. Just the code and a deadline.

The April 2026 run just dropped and GPT-5.4 leads at 83.93%, followed by GLM-5.1 at 80.13%, Claude Opus 4.6 at 79.95%, GPT-5.3 at 77.81%, and Gemini 3.1 Pro at 68.50%. Unlike static benchmarks, N-Day-Bench updates monthly with new vulnerabilities and retests the latest model versions, so you cannot game it by memorizing old CVEs.

But the Hacker News discussion surfaced real concerns. One commenter found Claude Opus 4.6 received an excellent grade despite failing to locate the target file, apparently hallucinating findings from training data. The creators acknowledge false positive rates remain high. The benchmark is a work in progress, not gospel.

Still, this is the only adaptive security benchmark that measures whether frontier models can function as vulnerability-finding agents in realistic conditions. The monthly cadence means the leaderboard actually reflects current capabilities, not a snapshot from six months ago.

https://ndaybench.winfunc.com

← Previous

CodeTracer: Finally, a Way to Debug AI Agents That Actually Debug Code

GitHub Stars Daily Spotlight — April 15, 2026

← Back to all articles

N-Day-Bench: Can Your LLM Actually Find Bugs, or Just Talk About Them?

More Articles

Comments