MiniMax Sparse Attention: 28× Compute Reduction at 1M-Token Context with No Quality Loss
MiniMax
MiniMax published a paper introducing a blockwise sparse attention mechanism built on Grouped Query Attention that achieves a 28.4× reduction in per-token attention compute at 1M-token context while matching the quality of full attention. The technique uses an Index Branch to score and select relevant KV blocks, with a Main Branch performing exact attention over the selected blocks. It underpins MiniMax M3, the first open-weight model combining frontier coding capability, 1M-token context, and native multimodality in a single architecture. The paper received 251 upvotes on HuggingFace Daily Papers.
Why it matters
Quadratic attention cost has been the primary barrier to practical 1M-token context windows. This work shows a 28× compute cut with negligible quality loss and ships a production model to prove it — not just a paper result. 251 upvotes on HF Daily Papers reflects strong community interest.
Importance: 4/5
Significant efficiency breakthrough for long-context inference with production evidence; 251 HF upvotes (+1 bump applied); backs the MiniMax M3 open-weight release.