SciCrafter Says Frontier Models Plateau at 26 Percent
There's a new agent benchmark called SciCrafter and it's already producing the kind of result that should embarrass everyone selling autonomous discovery as a feature.
The setup is parameterized redstone circuit tasks in Minecraft. The agent has to ignite a set of lamps in a specified pattern. Difficulty scales with the parameters in ways that punish memorization and reward genuine experimental discovery — you can't just look up the answer because the parameters change. They tested GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5. All three plateaued around 26 percent success.
What makes the paper land is the failure decomposition. The authors break it into four capacity gaps — knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application. The honest finding is the bottleneck is shifting toward problem formulation. Frontier models can do the experiments and absorb the results. They struggle to figure out which question to ask. That's an interesting place to be stuck because it's not the place the field has been investing.
The Minecraft framing is important. Pick any agent benchmark with high public traction — SWE-bench, WebArena, the rest — and the failure modes look like execution failures (wrong code, wrong action). SciCrafter is structured so the failure mode looks like discovery failure. That's a different shape of evaluation, and it slots in with the eval crisis cluster from the past two weeks (SWE-bench Verified contamination, DIVERT's over-spending finding, ClawMark's coworker-task numbers). The throughline — every benchmark with a different angle is showing frontier models plateau in different ways.
The team shipped code. Project page at https://scicrafter-bench.github.io/ and the GitHub at https://github.com/scicrafter-bench/scicraft-bench. Paper https://arxiv.org/abs/2604.24697.
← Back to all articles
The setup is parameterized redstone circuit tasks in Minecraft. The agent has to ignite a set of lamps in a specified pattern. Difficulty scales with the parameters in ways that punish memorization and reward genuine experimental discovery — you can't just look up the answer because the parameters change. They tested GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5. All three plateaued around 26 percent success.
What makes the paper land is the failure decomposition. The authors break it into four capacity gaps — knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application. The honest finding is the bottleneck is shifting toward problem formulation. Frontier models can do the experiments and absorb the results. They struggle to figure out which question to ask. That's an interesting place to be stuck because it's not the place the field has been investing.
The Minecraft framing is important. Pick any agent benchmark with high public traction — SWE-bench, WebArena, the rest — and the failure modes look like execution failures (wrong code, wrong action). SciCrafter is structured so the failure mode looks like discovery failure. That's a different shape of evaluation, and it slots in with the eval crisis cluster from the past two weeks (SWE-bench Verified contamination, DIVERT's over-spending finding, ClawMark's coworker-task numbers). The throughline — every benchmark with a different angle is showing frontier models plateau in different ways.
The team shipped code. Project page at https://scicrafter-bench.github.io/ and the GitHub at https://github.com/scicrafter-bench/scicraft-bench. Paper https://arxiv.org/abs/2604.24697.
Comments