A Systematic Analysis of Hybrid Linear Attention: 72-Model Study

ByteDance Seed

Research official + media 2 src. ~1 min

Researchers trained 72 open-source models (340M–1.3B parameters) across six linear attention variants at varying hybridization ratios. Key finding: the best standalone linear attention model does not make the best hybrid. Recall improves sharply when the ratio of full-attention layers rises above roughly 1-in-4. HGRN-2 and GatedDeltaNet at 3:1–6:1 ratios reach transformer-level recall with substantially lower compute on long sequences.

Why it matters

One of the most rigorous empirical studies on hybrid attention to date with open-sourced checkpoints; the practical guidance on architecture choice and mixing ratio is directly actionable for practitioners building long-context LLMs.

Importance: 3/5

72-model empirical study providing actionable guidance on hybrid attention architecture design; companion to FlashMorph from same lab

attention architecture long-context efficiency language-models

A Systematic Analysis of Hybrid Linear Attention: 72-Model Study

Why it matters

Related items

Sources