April 9, 2026Research Benchmark Agents

PTE: A New Metric Proves More Tool Calls Make Agents Worse

Everyone building agents assumes more tool calls means better results. A new paper from USTC proves the opposite: trajectories with higher tool-use costs tend to have lower reasoning correctness. Simply using more tools does not improve answer quality.

The paper introduces PTE — Prefill Token Equivalents — a hardware-aware efficiency metric for tool-integrated reasoning. The insight is that existing metrics like token counts and tool call counts completely miss what actually makes agents slow. When an agent calls an external tool, it creates a pause that evicts the KV-Cache, forcing recomputation. The unfiltered response from the tool inflates the cache, making every subsequent decode step slower. PTE captures all of this — internal reasoning cost, external tool cost, cache eviction penalty — in one number.

Validation against wall-clock latency in high-concurrency production settings shows PTE aligns significantly better than standard token counts. The authors identify four distinct inefficiency patterns across five tool-integrated reasoning benchmarks: redundant tool calls, overly verbose tool outputs, unnecessary reasoning loops, and premature tool invocation.

The counterintuitive finding is the important one. The assumption that agents should use all available tools aggressively is wrong. The best agent trajectories are the ones that call tools surgically — the right tool, at the right time, with the right query. Code is open-sourced.

https://github.com/sqs-ustc/tool-reasoning-framework-PTE

← Previous

TUI-use: Give AI Agents Access to Interactive Terminal Programs

Loop Daily: April 09, 2026

← Back to all articles

PTE: A New Metric Proves More Tool Calls Make Agents Worse

Related Articles

Comments