May 5, 2026ResearchAgentsBenchmark

Reflex Benchmarks Computer Use at 45x More Expensive

Reflex published a head-to-head benchmark today and HN ran it to #2 with 228 points. Same admin-panel task, two ways. Vision agent (browser-use stack) — 53 steps, 551k input tokens, 17 minutes. API agent (Sonnet) — 8 calls, 12k input tokens, 20 seconds. API agent (Haiku) — 8 calls, 9.5k tokens, 8 seconds. The 45x cost ratio is the headline number. Wall-clock is 50x. Token count is 50x.

Yes, Reflex is a vendor blog and they sell auto-API generation. Discount the conclusion accordingly. But the input-token count and step count are physical measurements — Anthropic and OpenAI bill on those, you can't argue them away. The line that lands: an agent that must see in order to act will always pay for the seeing. Vision is not free, every screenshot is a frame in the input window, and the input window is what you pay for.

This pairs straight into the Tool-Use Tax paper from last week (arXiv 2605.00136) and AgentFloor (2605.00334). Both are independent academic confirmations that tool overhead is real and asymmetric. Tool-Use Tax quantified that on a meaningful share of tasks the agent does worse with the tool than without. Reflex now puts a number on how much worse the worst tool — full computer-use vision — actually costs. Three data points in one week. The "wrap a frontier model with vision and call it an agent" startup playbook just got harder.

The structural read for me. There are two real architectures: structured-API agents (cheap, deterministic, narrow) and computer-use agents (expensive, flexible, broad). Both will exist. The question is which one wins where. For internal tools you control, Reflex is right — auto-generated API beats vision every time. For external tools you can't change, computer-use is the only option. Standard Intelligence ($75M Series A, May 3) is betting the latter category dominates because most of the world's apps will never expose a clean API. Reflex bets the former for everything else. Both will probably be right.

Blog post: reflex.dev/blog/computer-use-is-45x-more-expensive-than-structured-apis
← Previous
GLM-5V-Turbo Bets Multimodal Is the Default
Next →
T²PO Stops Multi-Turn Agents from Locking In
← Back to all articles

Comments

Loading...
>_