Xiaohongshu's HyperEyes Cuts Agent Tool Calls by 5.3x. Search Wider, Not Longer.
Xiaohongshu just put out HyperEyes on arXiv. 44 upvotes on HuggingFace Papers today, top of the agent-research feed. The pitch is one sentence — search wider, not longer. The premise is that multimodal search agents waste rounds by hitting one entity at a time, when the right unit of work is parallel search across many entities in a single turn.
The numbers are concrete. HyperEyes-30B beats the strongest comparable open-source agent by 9.9% in accuracy, while using 5.3x fewer tool-call rounds on average. HyperEyes-235B hits 66.6% average accuracy, approaching Gemini-3.1-Pro. They also released IMEB — 300 human-curated instances built specifically to measure search efficiency, not just final accuracy. Code at github.com/Guankai-Li/HyperEyes.
The trick is two-grained RL training. Macro level — TRACE, a tool-use reference-adaptive cost-efficiency reward that monotonically tightens the efficiency target during training. Micro level — on-policy distillation, where a teacher model injects dense token-level corrective signals on the agent's failed rollouts. Plus a unified grounded search primitive that fuses visual grounding and retrieval into a single atomic action — so the agent can issue concurrent search queries instead of serial ones.
Why this is structurally interesting — most agent research papers chase accuracy and treat tool calls as free. HyperEyes makes efficiency a co-equal objective. The agent that hits 60% accuracy in three rounds beats the agent that hits 65% in fifteen rounds, once you account for tokens, latency, and tool budgets. The IMEB benchmark formalizes this, and other groups will now have to defend their tool-call-round numbers, not just their accuracy.
The other interesting line — Xiaohongshu Inc shipping agent research from the consumer-platform side. Not a research lab, not a coding-agent company. The Chinese-internet content platform with 200M MAU is now publishing RL-trained multimodal agents with concrete code. Where the agent research is coming from is moving — not just labs, also the companies whose product surfaces are bleeding tokens. arxiv.org/abs/2605.07177.
← Back to all articles
The numbers are concrete. HyperEyes-30B beats the strongest comparable open-source agent by 9.9% in accuracy, while using 5.3x fewer tool-call rounds on average. HyperEyes-235B hits 66.6% average accuracy, approaching Gemini-3.1-Pro. They also released IMEB — 300 human-curated instances built specifically to measure search efficiency, not just final accuracy. Code at github.com/Guankai-Li/HyperEyes.
The trick is two-grained RL training. Macro level — TRACE, a tool-use reference-adaptive cost-efficiency reward that monotonically tightens the efficiency target during training. Micro level — on-policy distillation, where a teacher model injects dense token-level corrective signals on the agent's failed rollouts. Plus a unified grounded search primitive that fuses visual grounding and retrieval into a single atomic action — so the agent can issue concurrent search queries instead of serial ones.
Why this is structurally interesting — most agent research papers chase accuracy and treat tool calls as free. HyperEyes makes efficiency a co-equal objective. The agent that hits 60% accuracy in three rounds beats the agent that hits 65% in fifteen rounds, once you account for tokens, latency, and tool budgets. The IMEB benchmark formalizes this, and other groups will now have to defend their tool-call-round numbers, not just their accuracy.
The other interesting line — Xiaohongshu Inc shipping agent research from the consumer-platform side. Not a research lab, not a coding-agent company. The Chinese-internet content platform with 200M MAU is now publishing RL-trained multimodal agents with concrete code. Where the agent research is coming from is moving — not just labs, also the companies whose product surfaces are bleeding tokens. arxiv.org/abs/2605.07177.
Comments