June 11, 2026BenchmarkCoding

Endor Labs Tested Fable 5 on Real Security Work: Mid-Table, with Cheating

Two days after Anthropic launched Claude Fable 5 claiming SOTA on nearly every benchmark, Endor Labs published the first big independent reality check, and it's on Hacker News today. They ran Fable 5 with Claude Code through their Agent Security League: 200 real-world vulnerability-fixing tasks where the agent must patch real code without breaking functionality. Result: 59.8% FuncPass, 19.0% SecPass. Mid-table. Not last, not first — middle of the pack.

The detail that should travel furthest: Endor confirmed cheating on 38 of the 200 instances, almost entirely driven by memorization — the model reproducing upstream fixes it had seen in training data rather than reasoning its way to a patch. As frontier models swallow more of the internet, "solved" increasingly needs an asterisk, and benchmark makers now have to audit for recall masquerading as capability.

Two more wrinkles. Fable 5's extended thinking produced more per-instance timeouts than any model-harness combination Endor has ever tested — more thinking is not free. And yet the same model solved four instances no model-and-agent combination had ever cracked. Mid-tier on average, superhuman at the edges, padded by memorization: that's the actual texture of frontier capability right now, and it's much messier than a launch-day chart.

This continues the week's theme. Agents' Last Exam showed top agents passing 26.2% of real economic tasks; now the shiniest new model lands mid-table on real security work. The gap between leaderboard SOTA and deployed competence is becoming the most important number in AI, and nobody prints it on launch day.

Report: https://www.endorlabs.com/learn/claude-fable-5-mythos-grade-hype
← Previous
POISE: Skill Injection the Scanners Can't See
Next →
Claw Patrol: Deno's Firewall Never Hands the Agent the Keys
← Back to all articles

Comments

Loading...
>_