vLLM v0.24.0: Model Runner V2 Default, Rust Frontend, SM90 FP8 Speedups

vLLM

Tools official 1 src. ~1 min

vLLM v0.24.0 (released ~June 30) incorporates 571 commits from 256 contributors. Model Runner V2 is now the default engine for quantized models as well as Llama and Mistral dense models. The Rust frontend is production-ready with API-key authentication, CORS, and new tokenization endpoints. SM90 CUTLASS FP8 kernels deliver 180–290% kernel speedup on H100-class hardware. DeepSeek-V4 gets FlashInfer sparse-index caching, and new model support includes MiniMax-M3 and DiffusionGemma.

Why it matters

Model Runner V2 becoming default for quantized models is a production-readiness milestone. The Rust frontend enables vLLM to be deployed as a first-class production service without an additional proxy.

Importance: 3/5

Major vLLM release: 571 commits, Model Runner V2 default, Rust frontend production-ready, 180–290% FP8 speedup on H100

vllm inference quantization deepseek-v4 efficiency serving

Sources

official vLLM v0.24.0 Release Notes