Soohak: 64 Mathematicians Build Research-Level Benchmark That Stumps Frontier LLMs
Seoul National University
Soohak is a 439-problem benchmark authored from scratch by 64 professional mathematicians to evaluate whether frontier LLMs can reason at the level required to advance mathematical knowledge. Top models score only 10.4–30.4% on challenge problems (Claude Opus 4.5 at 10.4%, Gemini 3 Pro at 30.4%, GPT-5 at 26.4%). A novel refusal subset tests whether models can detect ill-posed problems and abstain — no model exceeds 50% on this dimension.
Why it matters
Provides the most rigorous evaluation of frontier model mathematical reasoning to date, showing even top models fail dramatically on genuine research-level problems and cannot reliably detect ill-posed questions.
Importance: 2/5
68 HF Daily upvotes; 64 professional mathematicians authoring benchmark reveals a large gap between LLM olympiad performance and actual research-level mathematical capability.