EVA-Bench: End-to-End Framework for Evaluating Voice Agents

ServiceNow AI

Research official + media 2 src. ~1 min

EVA-Bench provides end-to-end evaluation for voice agents through bot-to-bot audio conversation simulation. It introduces composite metrics EVA-A (task completion + speech fidelity) and EVA-X (conversation flow + turn-taking timing), plus a 213-scenario benchmark across three enterprise domains. Evaluation of 12 systems reveals no single system excels on both metrics, with a median gap of 0.44 between peak and reliable performance.

Why it matters

Voice agents are moving into enterprise production, but rigorous end-to-end evaluation has been lacking. EVA-Bench establishes the methodology and reveals sobering reliability gaps. 116 upvotes on HF Daily (May 14).

Importance: 4/5

116 HF Daily upvotes (+1 bump); first rigorous end-to-end voice agent evaluation framework

Sources

official arXiv: EVA-Bench