vLLM v0.21.0: Blackwell MLA Backend, HMA KV Offload, Spec Decode for Reasoning Models
vLLM Project
vLLM v0.21.0 shipped May 15, 2026 (367 commits, 202 contributors). Key additions: TOKENSPEED_MLA attention backend for DeepSeek-R1 and Kimi-K2.5 on NVIDIA Blackwell GPUs; KV offloading integrated with the Hybrid Memory Allocator (HMA); speculative decoding now respects reasoning/thinking budgets for correctness with reasoning models; Docker image reduced ~2.5 GB. Breaking changes: C++20 compiler required, Transformers v4 deprecated (must upgrade to v5).
Why it matters
TOKENSPEED_MLA on Blackwell enables production-grade serving of DeepSeek-R1-class models with better GPU utilization. Spec decode correctness for reasoning models is a long-awaited fix for anyone deploying thinking-budget-constrained models at scale.
Importance: 3/5
Major inference infrastructure release: Blackwell MLA backend, spec decode for reasoning models, 367 commits from 202 contributors