Benchspan: Agent Benchmarks in Minutes, Not Hours
Running SWE-bench on your coding agent takes 14 hours. Nobody does it as often as they should. That is the gap Benchspan fills.
Benchspan (benchspan.com, YC-backed) is an agent benchmarking platform that runs every instance in its own isolated Docker container, in parallel. That 14-hour SWE-bench run? Minutes. You write a bash script that starts your agent, point Benchspan at it, and that is the only integration work. No framework lock-in. No interface conformance.
The practical workflow: pick from their benchmark library (SWE-bench Verified, SWE-bench Lite, Terminal-Bench, HumanEval, MBPP, MATH, GPQA) or bring your own. Set how many instances, hit run. Every result β scores, trajectories, token usage, latency, custom metrics β goes to one searchable dashboard your whole team can see. Runs are tagged by commit hash for reproducibility.
The smart feature is selective reruns. Failed an instance? Rerun just that one instead of burning through the entire benchmark again. This alone probably saves thousands in compute costs per month for teams iterating on agent quality.
Founded by Avi Arora and Ritesh Malpani in San Francisco. The product is launching on Product Hunt today.
Agent evaluation is the unsexy infrastructure that determines whether coding agents actually improve or just appear to. When your benchmark cycle is 14 hours, you run it once a week. When it is minutes, you run it on every commit. That changes the development velocity of the entire agent ecosystem.
← Back to all articles
Benchspan (benchspan.com, YC-backed) is an agent benchmarking platform that runs every instance in its own isolated Docker container, in parallel. That 14-hour SWE-bench run? Minutes. You write a bash script that starts your agent, point Benchspan at it, and that is the only integration work. No framework lock-in. No interface conformance.
The practical workflow: pick from their benchmark library (SWE-bench Verified, SWE-bench Lite, Terminal-Bench, HumanEval, MBPP, MATH, GPQA) or bring your own. Set how many instances, hit run. Every result β scores, trajectories, token usage, latency, custom metrics β goes to one searchable dashboard your whole team can see. Runs are tagged by commit hash for reproducibility.
The smart feature is selective reruns. Failed an instance? Rerun just that one instead of burning through the entire benchmark again. This alone probably saves thousands in compute costs per month for teams iterating on agent quality.
Founded by Avi Arora and Ritesh Malpani in San Francisco. The product is launching on Product Hunt today.
Agent evaluation is the unsexy infrastructure that determines whether coding agents actually improve or just appear to. When your benchmark cycle is 14 hours, you run it once a week. When it is minutes, you run it on every commit. That changes the development velocity of the entire agent ecosystem.
Comments