vLLM v0.20.0 — third release in two weeks

vLLM

Tools official + media 2 src. ~1 min

On April 27, vLLM released v0.20.0 — the third version in half a month after v0.18.0 and v0.19.0. The April lineup brought gRPC serving, GPU-accelerated speculative decoding, advanced KV-cache offloading, full support for Gemma 4 (E2B/E4B/26B MoE/31B Dense with MoE routing, multimodality, reasoning traces, and tool use), and the async scheduler — overlap of engine scheduling with GPU execution — is now enabled by default.

Why it matters

The high release cadence fills the production-ready inference niche for fresh open models — a competitor to TensorRT-LLM and SGLang in speed of supporting new architectures.

Importance: 2/5

Minor inference release from an active series.

Sources