AutoResearchBench — a benchmark for autonomous scientific literature search by AI agents
BAAI
A new benchmark has been published for evaluating agents on autonomous scientific literature search and review. It includes two complementary setups: Deep Research (multi-step investigation leading to a specific target paper) and Wide Research (exhaustive collection of publications matching given criteria, scored by IoU). Even the strongest LLM agents reach only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research.
Why it matters
Closes a methodological gap between general-purpose web agents and the actual work of a researcher; the ~9% figures set a ceiling against which progress on research agents can be measured throughout 2026.
Importance: 2/5
New benchmark, prominent on HF Daily; a useful baseline for research agents