AgentFloor: small models do 80% of agent work
AgentFloor dropped on arXiv on May 1. Karmakar and Chatterjee. 16 open-weight models from 0.27B to 32B parameters, plus GPT-5 as a frontier baseline. 16,542 scored runs across a 30-task benchmark organized as a six-tier capability ladder. Instruction following, structured tool use, multi-step coordination, long-horizon planning. The headline finding is the kind that quietly resets pricing assumptions. Small and mid-sized open-weight models are already sufficient for the bulk of short-horizon, structured tool-use work. The strongest open-weight model matched GPT-5 on most tiers while being substantially cheaper and faster.
Where the gap shows up is exactly where you'd expect. Long-horizon planning that requires sustained coordination across many tool calls. Frontier models still pull ahead. Everywhere else the gap is small enough that your decision should be cost-driven, not capability-driven. If your agent's job is "look up the customer's last order, format an apology email, send it" you do not need GPT-5. If your agent's job is "diagnose this bug, plan a multi-file refactor, execute it across thirty minutes of tool calls" you probably do.
This is the right complement to two stories that landed earlier today. Frontier Coding Agents Implement AlphaZero (Sherwood et al., Apr 27) showed Claude Opus 4.7 was the only model that could execute end-to-end multi-day agent research at non-trivial scale. Frontier-or-bust on the hard end. AgentFloor draws the line on the other side. Most of agent work is the short-horizon end, and the small-model regime is already there. The takeaway for builders is bimodal infrastructure. Small open-weight models for the 80%, frontier API for the 20% that actually needs it.
The full benchmark, harness, sweep configs, and run corpus are released. Useful for anyone shipping production agents trying to figure out which calls really need the expensive model.
Paper: https://arxiv.org/abs/2605.00334.
← Back to all articles
Where the gap shows up is exactly where you'd expect. Long-horizon planning that requires sustained coordination across many tool calls. Frontier models still pull ahead. Everywhere else the gap is small enough that your decision should be cost-driven, not capability-driven. If your agent's job is "look up the customer's last order, format an apology email, send it" you do not need GPT-5. If your agent's job is "diagnose this bug, plan a multi-file refactor, execute it across thirty minutes of tool calls" you probably do.
This is the right complement to two stories that landed earlier today. Frontier Coding Agents Implement AlphaZero (Sherwood et al., Apr 27) showed Claude Opus 4.7 was the only model that could execute end-to-end multi-day agent research at non-trivial scale. Frontier-or-bust on the hard end. AgentFloor draws the line on the other side. Most of agent work is the short-horizon end, and the small-model regime is already there. The takeaway for builders is bimodal infrastructure. Small open-weight models for the 80%, frontier API for the 20% that actually needs it.
The full benchmark, harness, sweep configs, and run corpus are released. Useful for anyone shipping production agents trying to figure out which calls really need the expensive model.
Paper: https://arxiv.org/abs/2605.00334.
Comments