Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and a Fix

Research official 1 src. ~1 min

Accepted to ICML 2026, this paper (arXiv:2605.06611) traces attention sinks — where initial tokens disproportionately capture attention — to variance discrepancy in value aggregation, intensified when FFN layers activate 'super neurons,' causing dimension misalignment in first-token representations. Two controlled experiments validate the causal chain. The authors propose head-wise RMSNorm as an architectural fix that restores statistical balance, stabilizes outputs, and accelerates training convergence.

Why it matters

A mechanistic causal account of a widely observed but poorly understood phenomenon, with a concrete architectural remedy practically useful for long-context and efficient-inference system builders. ICML 2026 acceptance adds peer-review credibility.

Importance: 2/5

ICML 2026 accepted; first mechanistic causal account of attention sink with a validated architectural fix (head-wise RMSNorm).

Sources