The Self-Correction Illusion: LLMs Fix Others' Errors but Not Their Own — Role Labels Are the Cause

Research official 1 src. ~1 min

LLMs readily fix errors when presented as external input but fail to correct identical errors framed as their own prior output. The paper isolates the cause: chat-template role labels (user message vs. internal thought vs. tool output vs. system memory), not the content itself. Relabeling an internal erroneous claim as an external source increases explicit correction rates by 23-93 percentage points across 7 model families and 3 domains (p < 0.001 in 10/13 test cells). A prompt-structure intervention requiring no retraining achieves significant improvements.

Why it matters

Reframes LLM self-correction failure as an artifact of prompt structure rather than a fundamental cognitive limitation — both more actionable (fixable via prompting) and more revealing about how sensitive model behavior is to framing.

Importance: 3/5

Official arXiv publication; practical implications for agent system prompt design; isolates a root cause for a widely observed failure mode.

reasoning hallucination paper

Sources

official The Self-Correction Illusion — arXiv:2606.05976