vLLM v0.20.2: TurboQuant 2-bit KV Cache and FlashAttention 4 Default for MoE Serving
vLLM v0.20.2 patches the major v0.20.0 release. Headline v0.20.0 features include DeepSeek V4 support, FlashAttention 4 as default MLA prefill, TurboQuant 2-bit KV cache (4× memory capacity over standard FP16), and a CUDA 13 / PyTorch 2.11 / Transformers v5 baseline. The v0.20.2 patch stabilizes DeepSeek V4 with multi-stream GEMM, configurable GEMM knobs, and BF16/MXFP8 all-to-all, plus fixes for TopK cooperative deadlocks and NVFP4 MoE kernels on RTX Blackwell workstation GPUs.
Why it matters
TurboQuant 2-bit KV quadrupling memory capacity is a major efficiency gain for long-context serving; FA4 as MLA default improves MoE prefill performance at production scale.
Importance: 3/5
TurboQuant 2-bit KV cache (4× capacity) and FA4 as default MLA prefill deliver major efficiency gains for production-scale DeepSeek-class MoE serving.