vLLM v0.22.0: DeepSeek V4 Production Hardening, Rust Frontend, 28.9% Latency Drop

Tools official 1 src. ~1 min

vLLM v0.22.0 (released May 29, 2026) includes 459 commits from 230 contributors. Key highlights: DeepSeek V4 production hardening with NVFP4 fused MoE, full CUDA graph, and MTP speculative decoding; a new experimental Rust frontend with data-parallel serving supervisor; 28.9% end-to-end latency improvement via Cutlass FP8 batch-invariant inference; and multi-tier KV cache offloading to disk. AMD ROCm parity and NVIDIA Blackwell (SM12x) optimizations were also merged.

Why it matters

DeepSeek V4 is the most widely self-hosted frontier model; production-grade vLLM support plus a 28.9% latency improvement makes it significantly more viable for high-throughput deployments at scale.

Importance: 3/5

Official GitHub release; substantial performance improvements to the most widely-used open-source inference engine.

vllm inference open-source gpu deepseek

Sources

official Releases · vllm-project/vllm — GitHub