A Systematic Analysis of Hybrid Linear Attention: 72-Model Study
ByteDance Seed
Researchers trained 72 open-source models (340M–1.3B parameters) across six linear attention variants at varying hybridization ratios. Key finding: the best standalone linear attention model does not make the best hybrid. Recall improves sharply when the ratio of full-attention layers rises above roughly 1-in-4. HGRN-2 and GatedDeltaNet at 3:1–6:1 ratios reach transformer-level recall with substantially lower compute on long sequences.
Why it matters
One of the most rigorous empirical studies on hybrid attention to date with open-sourced checkpoints; the practical guidance on architecture choice and mixing ratio is directly actionable for practitioners building long-context LLMs.
Importance: 3/5
72-model empirical study providing actionable guidance on hybrid attention architecture design; companion to FlashMorph from same lab