SOOHAK: Frontier LLMs Solve Hard Math But Fail to Recognize Unsolvable Problems

Research official + media 2 src. ~1 min

A 64-mathematician consortium from CMU, EleutherAI, and Seoul National University published SOOHAK, a 439-problem research-level math benchmark. Frontier scores: Gemini 3 Pro 30.4%, GPT-5 26.4%, Claude Opus 4.5 10.4%. A 'refusal subset' of 99 intentionally ill-posed problems revealed no model exceeded 50% accuracy at refusing unsolvable questions — models regularly produced confident wrong answers on problems with no valid solution.

Why it matters

Scaling compute makes models better at solving hard math but does not help them recognize when a problem has no answer. This 'confident wrongness' failure mode has broad implications for deploying frontier LLMs in high-stakes scientific contexts.

Importance: 3/5

64-mathematician benchmark exposing confident wrongness in frontier models — no model exceeded 50% on unsolvable math problem recognition

benchmark mathematics reasoning gpt-5 evaluation

Sources

official Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs (arXiv)

media New math benchmark reveals AI models confidently solve problems that have no solution — The Decoder