OpenAI Releases GeneBench-Pro, a Frontier Benchmark for AI Agents in Biology

OpenAI

Research official + media 2 src. ~1 min

OpenAI released GeneBench-Pro (June 30), a 129-problem benchmark testing AI judgment across genomics, cancer biology, clinical diagnostics, and pharmacogenomics. Problems require sequential judgment calls that a human expert would take 20–40 hours to resolve. GPT-5.6 Sol scores 28.7% (31.5% in Pro mode); Claude Opus 4.8 scores 16.0%. Ten representative questions are open-sourced on Hugging Face.

Why it matters

Unlike knowledge-recall benchmarks, GeneBench-Pro measures 'research taste' under uncertainty. GPT-5.6 Sol failing more than 70% of expert-level tasks shows the gap between current frontier models and autonomous scientific reasoning.

Importance: 3/5

OpenAI releases first research-grade biology reasoning benchmark; GPT-5.6 Sol at 28.7% illustrates gap to autonomous science; open-sourced on Hugging Face

openai benchmark science life-sciences research

Sources

official Introducing GeneBench-Pro — OpenAI

media OpenAI GeneBench-Pro tests AI judgment in biology research