Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and a Fix
Accepted to ICML 2026, this paper (arXiv:2605.06611) traces attention sinks — where initial tokens disproportionately capture attention — to variance discrepancy in value aggregation, intensified when FFN layers activate 'super neurons,' causing dimension misalignment in first-token representations. Two controlled experiments validate the causal chain. The authors propose head-wise RMSNorm as an architectural fix that restores statistical balance, stabilizes outputs, and accelerates training convergence.
Why it matters
A mechanistic causal account of a widely observed but poorly understood phenomenon, with a concrete architectural remedy practically useful for long-context and efficient-inference system builders. ICML 2026 acceptance adds peer-review credibility.
Importance: 2/5
ICML 2026 accepted; first mechanistic causal account of attention sink with a validated architectural fix (head-wise RMSNorm).