May 14, 2026Agents Benchmark Research

EVA-Bench Builds the First Serious End-to-End Voice Agent Benchmark

EVA-Bench paper (arXiv 2605.13841, 114 upvotes on HuggingFace) is the voice agent category's first real end-to-end evaluation framework. Multi-author team including Tara Bogavelli, Gabrielle Gauthier Melançon, Hari Subramani — likely Mila / Cohere lineage based on the byline pattern.

The innovation is orchestrating bot-to-bot audio conversations over dynamic multi-turn dialogues. So instead of testing voice agents on canned turns and ASR accuracy, EVA-Bench has voice agents talk to simulated callers under controlled scenarios and grades the result on two axes: EVA-A for task accuracy, EVA-X for experience quality. 213 enterprise scenarios in the benchmark, plus robustness tests across accents and noise conditions.

The punchline number is brutal. No system tested achieves above 0.5 on both metrics simultaneously. Median gap between peak and reliable performance is 0.44. Accent and noise perturbations expose architecture-specific weaknesses that don't show up in scripted-turn tests. Voice agents in 2026 are good at single happy-path conversations and remain unreliable everywhere else.

Why it matters: Vapi just raised $50M Series B at $500M post on the back of running 100% of Amazon Ring's inbound voice. ElevenLabs is at $11B. Voice agent funding has been the hottest enterprise AI sub-sector for two quarters. EVA-Bench is the first benchmark that lets buyers shop on something other than vendor demos. Expect this to become the SWE-Bench of voice the same way SWE-Bench became the reference for coding agents. Full framework, suite, and benchmark data are being released open-source. arXiv 2605.13841 for the paper.

← Previous

Paper Tells Memory Agents to Stop Consolidating

Super User Daily: 2026-05-15

← Back to all articles

EVA-Bench Builds the First Serious End-to-End Voice Agent Benchmark

More Articles

Comments