OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training Across Six Models
OpenAI
OpenAI disclosed that six released models — GPT-5.4 Thinking, GPT-5.1–5.4 Instant, and GPT-5.3–5.4 mini — were inadvertently exposed to chain-of-thought grading during RL training, a practice their policy prohibits because it creates incentives for models to produce misleading reasoning traces. An automated detection system based on regex matching identified three specific accidental CoT-grading instances; reward pathways were fixed and ablations found no clear reduction in CoT monitorability, though unmeasured effects cannot be ruled out. Redwood Research provided an independent external review.
Why it matters
Rare public safety disclosure from OpenAI about a training mistake affecting multiple released models; accidental CoT grading could suppress evidence of misaligned goals in model reasoning traces.
Importance: 3/5
Public disclosure of a policy violation in training for six released models; independent Redwood Research review highlights risk of suppressed misaligned-goal evidence in reasoning traces.