SOOHAK: Frontier LLMs Solve Hard Math But Fail to Recognize Unsolvable Problems
A 64-mathematician consortium from CMU, EleutherAI, and Seoul National University published SOOHAK, a 439-problem research-level math benchmark. Frontier scores: Gemini 3 Pro 30.4%, GPT-5 26.4%, Claude Opus 4.5 10.4%. A 'refusal subset' of 99 intentionally ill-posed problems revealed no model exceeded 50% accuracy at refusing unsolvable questions — models regularly produced confident wrong answers on problems with no valid solution.
Why it matters
Scaling compute makes models better at solving hard math but does not help them recognize when a problem has no answer. This 'confident wrongness' failure mode has broad implications for deploying frontier LLMs in high-stakes scientific contexts.
Importance: 3/5
64-mathematician benchmark exposing confident wrongness in frontier models — no model exceeded 50% on unsolvable math problem recognition