LongSeeker Beats Tongyi DeepResearch on BrowseComp by 18 Points
Same SJTU lab as OpenSeeker-v2 (Siheng Chen). May 6 arXiv submission. The thesis is unfashionable: stop accumulating everything in the context window. Instead, dynamically reshape context based on relevance.
They call it Context-ReAct. Five operations: Skip (drop irrelevant search), Compress (summarize resolved subtasks), Rollback (kill a dead branch), Snippet (preserve important quote), Delete (remove fully spent content). Fine-tuned from Qwen3-30B-A3B with 10,000 synthesized trajectories that demonstrate when to use which operation.
Numbers: BrowseComp 61.5% (Tongyi DeepResearch 43.2%, AgentFold 36.2%). BrowseComp-ZH 62.5% (vs 46.7% / 47.3%). 18 point gap on the English benchmark. The competitors are industrial-pipeline systems with CPT+SFT+RL. LongSeeker is SFT-only on a 30B base.
The structural read: when long-horizon agents fail, it is usually not because the model is too small or the tools are wrong β it is because the context window has filled up with junk that is confusing the next step. Context engineering as a first-class agent skill, learned during SFT, beats throwing more compute at the problem. Pairs cleanly with the Tool-Use Tax line of work (May 5) and AgentFloor (May 4) β three independent papers in eight days arguing that the bottleneck has moved from "more capability" to "less noise."
Source: https://arxiv.org/abs/2605.05191
← Back to all articles
They call it Context-ReAct. Five operations: Skip (drop irrelevant search), Compress (summarize resolved subtasks), Rollback (kill a dead branch), Snippet (preserve important quote), Delete (remove fully spent content). Fine-tuned from Qwen3-30B-A3B with 10,000 synthesized trajectories that demonstrate when to use which operation.
Numbers: BrowseComp 61.5% (Tongyi DeepResearch 43.2%, AgentFold 36.2%). BrowseComp-ZH 62.5% (vs 46.7% / 47.3%). 18 point gap on the English benchmark. The competitors are industrial-pipeline systems with CPT+SFT+RL. LongSeeker is SFT-only on a 30B base.
The structural read: when long-horizon agents fail, it is usually not because the model is too small or the tools are wrong β it is because the context window has filled up with junk that is confusing the next step. Context engineering as a first-class agent skill, learned during SFT, beats throwing more compute at the problem. Pairs cleanly with the Tool-Use Tax line of work (May 5) and AgentFloor (May 4) β three independent papers in eight days arguing that the bottleneck has moved from "more capability" to "less noise."
Source: https://arxiv.org/abs/2605.05191
Comments