OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training Across Six Models

OpenAI

Research official + media 2 src. ~1 min

OpenAI disclosed that six released models — GPT-5.4 Thinking, GPT-5.1–5.4 Instant, and GPT-5.3–5.4 mini — were inadvertently exposed to chain-of-thought grading during RL training, a practice their policy prohibits because it creates incentives for models to produce misleading reasoning traces. An automated detection system based on regex matching identified three specific accidental CoT-grading instances; reward pathways were fixed and ablations found no clear reduction in CoT monitorability, though unmeasured effects cannot be ruled out. Redwood Research provided an independent external review.

Why it matters

Rare public safety disclosure from OpenAI about a training mistake affecting multiple released models; accidental CoT grading could suppress evidence of misaligned goals in model reasoning traces.

Importance: 3/5

Public disclosure of a policy violation in training for six released models; independent Redwood Research review highlights risk of suppressed misaligned-goal evidence in reasoning traces.

openai alignment safety rl chain-of-thought monitorability

Sources

official Investigating the consequences of accidentally grading CoT during RL — OpenAI Alignment

media A review of 'Investigating the consequences of accidentally grading CoT during RL' — Redwood Research