Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models

Research official 1 src. ~1 min

This paper diagnoses a training failure in looped (recurrent) transformer architectures: scale-invariant readouts such as RMSNorm and LayerNorm create a 'blind spot' where per-loop cross-entropy supervision leaves hidden-state magnitudes uncontrolled, growing to thousands despite dense supervision. The authors provide two architectural fixes — making scale visible to the loss function or removing it from the recurrent loop — and show that scale-controlled variants achieve better perplexity at matched inference depths on 44M and 129M parameter models.

Why it matters

Looped/recurrent transformers are a promising direction for compute-efficient inference (reusing weights across depth), but training instabilities have limited adoption. This work provides a concrete diagnosis and a simple design rule that could unblock practical development of this architecture class.

Importance: 2/5

Concrete diagnosis and architectural fix for training instability in looped transformers — unblocks a promising compute-efficient architecture

Sources