April 29, 2026AgentsBenchmarkResearch

Frontier Agents Hit 9% on Scientific Literature Search

AutoResearchBench dropped on arXiv April 28 with the kind of scoreboard that should make every research-agent vendor uncomfortable. The benchmark has two tasks. Deep Research, where the agent has to find a specific target paper through multi-step investigation. Wide Research, where it has to gather all papers meeting some criteria. The most powerful frontier LLMs score 9.39 percent accuracy on Deep Research and 9.31 percent IoU on Wide Research. Most strong baselines fall below 5 percent.

For context, the same models score considerably higher on general web browsing benchmarks like BrowseComp. Scientific literature is what breaks them. Long papers, dense terminology, indirect references, and the need to actually understand what a method does and whether it relates to the question asked. The benchmark is explicitly research-oriented: it requires in-depth comprehension of scientific concepts, not just lookup-then-cite.

The editorial point is simple. Most autoresearch demos you see on Twitter are running on tasks where the answer can be found by clicking through a few search results and reading abstracts. Real research starts after that. AutoResearchBench is a useful pin in the hype: every demo claiming agent-grade scientific literature work should be reporting numbers on this benchmark before claiming the bottleneck is anywhere except agent capability.

The benchmark is open source on GitHub. The 9 percent number is fresh. Watch which research-agent companies report it within the next quarter and which quietly avoid the question.

Links: https://arxiv.org/abs/2604.25256 https://github.com/CherYou/AutoResearchBench
← Previous
Stanford Trains a Multi-Agent System Like One Model
Next β†’
SkillSynth Builds Terminal Tasks From a Skill Graph
← Back to all articles

Comments

Loading...
>_