Soohak: 64 Mathematicians Build Research-Level Benchmark That Stumps Frontier LLMs

Seoul National University

Research official + media 2 src. ~1 min

Soohak is a 439-problem benchmark authored from scratch by 64 professional mathematicians to evaluate whether frontier LLMs can reason at the level required to advance mathematical knowledge. Top models score only 10.4–30.4% on challenge problems (Claude Opus 4.5 at 10.4%, Gemini 3 Pro at 30.4%, GPT-5 at 26.4%). A novel refusal subset tests whether models can detect ill-posed problems and abstain — no model exceeds 50% on this dimension.

Why it matters

Provides the most rigorous evaluation of frontier model mathematical reasoning to date, showing even top models fail dramatically on genuine research-level problems and cannot reliably detect ill-posed questions.

Importance: 2/5

68 HF Daily upvotes; 64 professional mathematicians authoring benchmark reveals a large gap between LLM olympiad performance and actual research-level mathematical capability.

benchmark mathematics reasoning

Sources

official Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs — arXiv

media Soohak — Hugging Face Daily Papers