April 11, 2026BenchmarkAgentsResearch

KnowU-Bench: Finally, a Benchmark That Tests If Agents Know When to Shut Up

Most agent benchmarks test one thing: can you complete the task? But real-world mobile agents need a harder skill — knowing when NOT to act. KnowU-Bench from Zhejiang University is the first benchmark that evaluates whether agents can be proactive, personalized, and interactive in realistic Android environments.

The benchmark creates structured user personas with recurring habits and clean/noisy history logs. Then it tests four agent behaviors: act (do it now), ask (clarify first), wait (not the right time), and stay silent (this isn't your business). A truly useful mobile assistant shouldn't just follow orders — it should know your patterns well enough to anticipate needs, but also know when to back off.

Current models struggle badly with this. The benchmark includes baselines showing that even frontier models default to action when they should be staying quiet, or ask unnecessary questions when the context already contains the answer. The proactive tasks are particularly revealing: they require agents to decide whether to act based on the user's historical patterns, not just the current instruction.

KnowU-Bench ships with evaluation runners, metric calculators, and a log viewer for debugging agent trajectories. If you're building mobile agents that need to coexist with humans rather than just execute commands, this is the benchmark that will tell you how far you still have to go.

Code: https://github.com/ZJU-REAL/KnowU-Bench
← Previous
Metis: The Agent That Learned When NOT to Use Tools
← Back to all articles

Comments

Loading...
>_