MiniMax Sparse Attention: 28× Compute Reduction at 1M-Token Context with No Quality Loss

MiniMax

Research official 3 src. ~1 min

MiniMax published a paper introducing a blockwise sparse attention mechanism built on Grouped Query Attention that achieves a 28.4× reduction in per-token attention compute at 1M-token context while matching the quality of full attention. The technique uses an Index Branch to score and select relevant KV blocks, with a Main Branch performing exact attention over the selected blocks. It underpins MiniMax M3, the first open-weight model combining frontier coding capability, 1M-token context, and native multimodality in a single architecture. The paper received 251 upvotes on HuggingFace Daily Papers.

Why it matters

Quadratic attention cost has been the primary barrier to practical 1M-token context windows. This work shows a 28× compute cut with negligible quality loss and ships a production model to prove it — not just a paper result. 251 upvotes on HF Daily Papers reflects strong community interest.

Importance: 4/5

Significant efficiency breakthrough for long-context inference with production evidence; 251 HF upvotes (+1 bump applied); backs the MiniMax M3 open-weight release.

Sources