#evaluation
- EVA-Bench: End-to-End Framework for Evaluating Voice Agents ServiceNow AI research
- SOOHAK: Frontier LLMs Solve Hard Math But Fail to Recognize Unsolvable Problems research
- OpenAI Publishes Deployment Simulation: Predicting Model Behavior Before Release OpenAI research
- Judge Circuits: Mechanistic Explanation of LLM-as-Judge Format Inconsistency research
- EvoArena: LLM Agents Score Only 40% on Dynamic Evolving Environments MIT / NUS / Salesforce research
- WeaveBench: Computer-Use Agents Fail at Hybrid GUI+CLI Tasks — 41% Pass Rate Microsoft Research research
- AutoResearchBench — a benchmark for autonomous scientific literature search by AI agents BAAI research
- EvoArena: LLM Agents Score Only 39.6% on Dynamic Evolving Environments Benchmark MIT research
- Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agent Loops research