QUEST trains a frontier research agent on 8,000 made-up tasks
Where does the training data for a deep research agent even come from? You can't scrape good multi-step research off the internet, it doesn't exist as labeled data. QUEST, a new open release from a team that includes Ohio State's Yu Su and Huan Sun, says you don't need to scrape it. You can manufacture it.
QUEST is a family of open models, 2B to 35B, built to be general-purpose deep research agents, the kind that run long multi-step searches, verify facts, chase citations, and write a real report at the end. The trick is in the data. They built a synthesis pipeline on what they call unified rubric trees: it generates research tasks that come with measurable rewards built in, no human annotation anywhere. Then mid-training, fine-tuning, and RL on top of that.
The number that should stop you is 8,000 synthesized tasks. That's the whole budget. On eight deep-research benchmarks the models approach or beat proprietary frontier systems, and they're the strongest among open-weight competitors. Eight thousand fake tasks, fully open weights and code, matching closed labs that spend fortunes on human-labeled data.
This is the loop everyone has been circling. If an agent can generate its own training tasks with verifiable rewards, the data bottleneck that gated frontier capability starts to dissolve. The expensive part was never the compute, it was the labeled examples. QUEST is a clean demonstration that for research agents, you can just print them.
Paper: arxiv.org/abs/2605.24218
← Back to all articles
QUEST is a family of open models, 2B to 35B, built to be general-purpose deep research agents, the kind that run long multi-step searches, verify facts, chase citations, and write a real report at the end. The trick is in the data. They built a synthesis pipeline on what they call unified rubric trees: it generates research tasks that come with measurable rewards built in, no human annotation anywhere. Then mid-training, fine-tuning, and RL on top of that.
The number that should stop you is 8,000 synthesized tasks. That's the whole budget. On eight deep-research benchmarks the models approach or beat proprietary frontier systems, and they're the strongest among open-weight competitors. Eight thousand fake tasks, fully open weights and code, matching closed labs that spend fortunes on human-labeled data.
This is the loop everyone has been circling. If an agent can generate its own training tasks with verifiable rewards, the data bottleneck that gated frontier capability starts to dissolve. The expensive part was never the compute, it was the labeled examples. QUEST is a clean demonstration that for research agents, you can just print them.
Paper: arxiv.org/abs/2605.24218
Comments