Every Tool Call Has a Tax. New Paper Quantifies It.
arXiv 2605.00136, Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents, dropped this week from Kaituo Zhang and collaborators. The thing it quantifies is the part most agent product pitches gloss over — every tool call eats latency, eats tokens, eats accuracy when the LLM has to interpret a tool response that doesn't fit the schema it expected.
The methodology is straightforward. They ran agent benchmarks with and without tool access for the same set of tasks. The deltas are the tax — what you actually pay to add a tool to your agent's belt. The numbers are not small. On a meaningful share of tasks the agent does worse with the tool than without, because the tool response derails the chain of thought.
This pairs cleanly with AgentFloor (arXiv 2605.00334) from a few days ago, which showed small open-weight models match GPT-5 on most short-horizon tool-use tiers. Together you get a sharp two-step argument. Step one: tools cost more than you think. Step two: when tools work, small models are good enough. So the wrap-a-frontier-model-with-N-tools startup playbook is structurally questionable — frontier capability gets you long-horizon planning, not short-horizon tool dispatch.
The implication for builders is that tool-use design isn't free architecture work. Every tool you add needs to clear an accuracy and latency bar against the no-tool baseline. The paper is the receipt that this bar is non-trivial.
https://arxiv.org/abs/2605.00136
← Back to all articles
The methodology is straightforward. They ran agent benchmarks with and without tool access for the same set of tasks. The deltas are the tax — what you actually pay to add a tool to your agent's belt. The numbers are not small. On a meaningful share of tasks the agent does worse with the tool than without, because the tool response derails the chain of thought.
This pairs cleanly with AgentFloor (arXiv 2605.00334) from a few days ago, which showed small open-weight models match GPT-5 on most short-horizon tool-use tiers. Together you get a sharp two-step argument. Step one: tools cost more than you think. Step two: when tools work, small models are good enough. So the wrap-a-frontier-model-with-N-tools startup playbook is structurally questionable — frontier capability gets you long-horizon planning, not short-horizon tool dispatch.
The implication for builders is that tool-use design isn't free architecture work. Every tool you add needs to clear an accuracy and latency bar against the no-tool baseline. The paper is the receipt that this bar is non-trivial.
https://arxiv.org/abs/2605.00136
Comments