evaluation — AI Digest

15 мая EVA-Bench: End-to-End Framework for Evaluating Voice Agents ServiceNow AI research
18 мая SOOHAK: Frontier LLMs Solve Hard Math But Fail to Recognize Unsolvable Problems research
19 июн OpenAI Publishes Deployment Simulation: Predicting Model Behavior Before Release OpenAI research
18 мая Judge Circuits: Mechanistic Explanation of LLM-as-Judge Format Inconsistency research
14 июн EvoArena: LLM Agents Score Only 40% on Dynamic Evolving Environments MIT / NUS / Salesforce research
14 июн WeaveBench: Computer-Use Agents Fail at Hybrid GUI+CLI Tasks — 41% Pass Rate Microsoft Research research
1 мая AutoResearchBench — a benchmark for autonomous scientific literature search by AI agents BAAI research
12 июн EvoArena: LLM Agents Score Only 39.6% on Dynamic Evolving Environments Benchmark MIT research
19 июн Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agent Loops research