vLLM v0.20.0 — third release in two weeks
vLLM
On April 27, vLLM released v0.20.0 — the third version in half a month after v0.18.0 and v0.19.0. The April lineup brought gRPC serving, GPU-accelerated speculative decoding, advanced KV-cache offloading, full support for Gemma 4 (E2B/E4B/26B MoE/31B Dense with MoE routing, multimodality, reasoning traces, and tool use), and the async scheduler — overlap of engine scheduling with GPU execution — is now enabled by default.
Why it matters
The high release cadence fills the production-ready inference niche for fresh open models — a competitor to TensorRT-LLM and SGLang in speed of supporting new architectures.
Importance: 2/5
Minor inference release from an active series.
Sources
official
GitHub Releases — vllm-project/vllm
secondary
Fazm — vLLM Update April 2026