Mean Mode Screaming: Training Pathology Fix Enables 1000-Layer Diffusion Transformers
This paper identifies Mean Mode Screaming (MMS) — a training collapse where Diffusion Transformers at extreme depths suppress token variation while loss appears stable. The proposed Mean-Variance Split (MV-Split) Residuals combine a separately gained centered residual update with a leaky trunk-mean replacement, eliminating collapse events and enabling stable training of 1000-layer DiTs.
Why it matters
119 HF Daily upvotes; directly relevant to scaling generative models — prior depth-scaling efforts for DiT-based pipelines had this hidden failure mode that was only now diagnosed and resolved architecturally.
Importance: 3/5
119 HF Daily upvotes; identifies and fixes a previously hidden training collapse at extreme DiT depths, enabling 1000-layer diffusion transformer architectures.