May 2, 2026BenchmarkAgentsResearch

KellyBench — frontier models lose money betting on the Premier League

New benchmark on arXiv April 30. KellyBench drops frontier models into a simulated 2023-24 English Premier League season as betting agents. Goal: maximize bankroll growth using historical data, stats, lineups, public odds. Outcome: every model lost money. Top performer averaged -8% returns. Several models blew up multiple bankrolls.

Claude Opus 4.6 scored 26.5% on the human-expert rubric, which is to say its strategies looked unsophisticated next to a baseline human bettor. Most of what the models attempted was lookup-and-restate, not market-inefficiency hunting. None of them developed actual ML pipelines for predictions, despite being explicitly told they could.

This is the third long-horizon-eval paper to land in two weeks (Synthetic Computers at Scale, Claw-Eval-Live, KellyBench). Each one finds the same wall: agents do fine on procedural tasks with a clear win condition; they collapse on long-horizon, non-stationary environments where the goal is open-ended and the world keeps changing on you. Sports betting is a clean instance of this — the league rolls forward, the odds shift, your strategy from week 1 may be wrong by week 20.

The author angle worth noting: this is the first eval paper to put a financial loss number on agent capability gaps. "-8% on the best frontier model" is the kind of single-line summary that makes its way into trader Slacks fast. Anyone running an LLM-based trading agent should read the methodology section before their backtest looks too good.

No public code repo yet — accessible as an open API endpoint per the paper. arXiv: https://arxiv.org/abs/2604.27865
← Previous
Hightouch raises $150M Series D — agentic CDP at $2.75B
Next →
"Contextual Agentic Memory is a Memo, Not True Memory" — agents don't actually remember
← Back to all articles

Comments

Loading...
>_