OpenAI Releases GeneBench-Pro, a Frontier Benchmark for AI Agents in Biology
OpenAI
OpenAI released GeneBench-Pro (June 30), a 129-problem benchmark testing AI judgment across genomics, cancer biology, clinical diagnostics, and pharmacogenomics. Problems require sequential judgment calls that a human expert would take 20–40 hours to resolve. GPT-5.6 Sol scores 28.7% (31.5% in Pro mode); Claude Opus 4.8 scores 16.0%. Ten representative questions are open-sourced on Hugging Face.
Why it matters
Unlike knowledge-recall benchmarks, GeneBench-Pro measures 'research taste' under uncertainty. GPT-5.6 Sol failing more than 70% of expert-level tasks shows the gap between current frontier models and autonomous scientific reasoning.
Importance: 3/5
OpenAI releases first research-grade biology reasoning benchmark; GPT-5.6 Sol at 28.7% illustrates gap to autonomous science; open-sourced on Hugging Face