vLLM v0.20.2: TurboQuant 2-bit KV Cache and FlashAttention 4 Default for MoE Serving

Tools official 2 src. ~1 min

vLLM v0.20.2 patches the major v0.20.0 release. Headline v0.20.0 features include DeepSeek V4 support, FlashAttention 4 as default MLA prefill, TurboQuant 2-bit KV cache (4× memory capacity over standard FP16), and a CUDA 13 / PyTorch 2.11 / Transformers v5 baseline. The v0.20.2 patch stabilizes DeepSeek V4 with multi-stream GEMM, configurable GEMM knobs, and BF16/MXFP8 all-to-all, plus fixes for TopK cooperative deadlocks and NVFP4 MoE kernels on RTX Blackwell workstation GPUs.

Why it matters

TurboQuant 2-bit KV quadrupling memory capacity is a major efficiency gain for long-context serving; FA4 as MLA default improves MoE prefill performance at production scale.

Importance: 3/5

TurboQuant 2-bit KV cache (4× capacity) and FA4 as default MLA prefill deliver major efficiency gains for production-scale DeepSeek-class MoE serving.

vllm inference open-source release

Sources

official Releases — vllm-project/vllm

official v0.20.2 Milestone — vllm-project/vllm