vLLM v0.21.0: Blackwell MLA Backend, HMA KV Offload, Spec Decode for Reasoning Models

vLLM Project

Tools official 2 src. ~1 min

vLLM v0.21.0 shipped May 15, 2026 (367 commits, 202 contributors). Key additions: TOKENSPEED_MLA attention backend for DeepSeek-R1 and Kimi-K2.5 on NVIDIA Blackwell GPUs; KV offloading integrated with the Hybrid Memory Allocator (HMA); speculative decoding now respects reasoning/thinking budgets for correctness with reasoning models; Docker image reduced ~2.5 GB. Breaking changes: C++20 compiler required, Transformers v4 deprecated (must upgrade to v5).

Why it matters

TOKENSPEED_MLA on Blackwell enables production-grade serving of DeepSeek-R1-class models with better GPU utilization. Spec decode correctness for reasoning models is a long-awaited fix for anyone deploying thinking-budget-constrained models at scale.

Importance: 3/5

Major inference infrastructure release: Blackwell MLA backend, spec decode for reasoning models, 367 commits from 202 contributors

vllm inference open-source gpu deepseek speculative-decoding

Sources

official Release v0.21.0 — vllm-project/vllm — GitHub

official vllm — PyPI