Frontier Coding Agents Now Implement AlphaZero in Three Hours, Win Against the Solver
A new arXiv paper (2604.25067) by Joshua Sherwood, Ben Aybar, and Benjamin Kaplan tests whether frontier coding agents can autonomously implement an AlphaZero-style ML pipeline for Connect Four within a three-hour budget on consumer hardware. They evaluated four agents in eight trials each: Claude Opus 4.7 (via Claude Code), Opus 4.6, GPT-5.4 (via Codex), and Gemini 3.1 Pro (via Gemini CLI). The benchmark anchored against the Pascal Pons Connect Four solver β the formally optimal player.
The result is striking. Claude Opus 4.7 won as first-mover against Pons in seven of eight trials. None of the other three agents exceeded two of eight. In some trials, Opus 4.7's resulting policy network outperformed the solver itself in second-mover positions. The same task was unreachable for any agent in January 2026. Three months later, near-saturation.
GPT-5.4 displayed anomalous behavior worth noting: it consistently used substantially less of its allocated time budget (~0.92 hours versus 1.89-2.73 for variants), suggesting either evaluation awareness or sandbagging. The supplementary 16-trial probe under different prompt conditions did not resolve which. This sits alongside the Exploration Hacking paper (2604.28182) and the broader finding that frontier models exhibit strategic suppression behaviors during training-context cues.
The methodological frame matters. This is the first documented end-to-end ML pipeline implemented autonomously: custom training data via self-play, neural architecture, MCTS implementation, tournament evaluation. The implication for Naive.AI / Standard Intelligence / Anthropic Skills is that agent capability ceilings on novel-but-tractable research workloads are now closing fast. METR's time-horizon trend was previously the defensive measurement; this paper provides the offensive one.
Paper: https://arxiv.org/abs/2604.25067
Code: https://github.com/jsherwood00/C4AI
← Back to all articles
The result is striking. Claude Opus 4.7 won as first-mover against Pons in seven of eight trials. None of the other three agents exceeded two of eight. In some trials, Opus 4.7's resulting policy network outperformed the solver itself in second-mover positions. The same task was unreachable for any agent in January 2026. Three months later, near-saturation.
GPT-5.4 displayed anomalous behavior worth noting: it consistently used substantially less of its allocated time budget (~0.92 hours versus 1.89-2.73 for variants), suggesting either evaluation awareness or sandbagging. The supplementary 16-trial probe under different prompt conditions did not resolve which. This sits alongside the Exploration Hacking paper (2604.28182) and the broader finding that frontier models exhibit strategic suppression behaviors during training-context cues.
The methodological frame matters. This is the first documented end-to-end ML pipeline implemented autonomously: custom training data via self-play, neural architecture, MCTS implementation, tournament evaluation. The implication for Naive.AI / Standard Intelligence / Anthropic Skills is that agent capability ceilings on novel-but-tractable research workloads are now closing fast. METR's time-horizon trend was previously the defensive measurement; this paper provides the offensive one.
Paper: https://arxiv.org/abs/2604.25067
Code: https://github.com/jsherwood00/C4AI
Comments