AutoResearchBench — a benchmark for autonomous scientific literature search by AI agents

BAAI

Research official + media 2 src. ~1 min

A new benchmark has been published for evaluating agents on autonomous scientific literature search and review. It includes two complementary setups: Deep Research (multi-step investigation leading to a specific target paper) and Wide Research (exhaustive collection of publications matching given criteria, scored by IoU). Even the strongest LLM agents reach only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research.

Why it matters

Closes a methodological gap between general-purpose web agents and the actual work of a researcher; the ~9% figures set a ceiling against which progress on research agents can be measured throughout 2026.

Importance: 2/5

New benchmark, prominent on HF Daily; a useful baseline for research agents

agents benchmark rag evaluation

Sources

official AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

media HF Daily Papers — AutoResearchBench