vLLM v0.23.0: Model Runner V2 Default for Llama and Mistral, Transformers v5, Multi-Tier KV Cache

Tools official 1 src. ~1 min

vLLM v0.23.0 (June 15, 408 commits, 200 contributors) makes Model Runner V2 the default for Llama and Mistral dense models, adds Transformers v5 compatibility, multi-tier KV cache offloading with object-store secondary tier, a unified reasoning + tool-call parser, Gemma 4 encoder-free support, and Rust frontend gains including streaming generate and dynamic LoRA. Also includes DeepSeek-V4 production hardening and ROCm 7.2.3 / FlashInfer v0.6.12 updates.

Why it matters

MRv2 expansion to Llama and Mistral covers the two most widely-deployed open-weight model families, eliminating pipeline-parallel bubbles. The unified parser simplifies integration for tool-calling and reasoning workflows.

Importance: 3/5

Major vLLM release (408 commits) expanding MRv2 to the two most popular open-weight model families

vllm inference open-source deepseek gemma

Sources

official vLLM v0.23.0 release notes