SGLang v0.5.11: Speculative Decoding V2 as Default and Eight New Model Architectures

Tools official 1 src. ~1 min

SGLang v0.5.11 switches to CUDA 13 + PyTorch 2.11 as its default baseline and enables Speculative Decoding V2 with overlap scheduling by default, reducing per-step CPU cost. The release adds support for eight new model architectures including Gemma 4, GLM-5.1, Qwen3.6, and Kimi-K2.6, and extends LoRA support to frontier-scale MLA-based MoE models such as DeepSeek-V3.

Why it matters

Speculative decoding V2 as the default changes the throughput baseline for all SGLang deployments; LoRA on DeepSeek-V3/Kimi-K2 unlocks fine-tuned variants of the leading open MoE models at production scale.

Importance: 3/5

Major baseline upgrade (CUDA 13 + PyTorch 2.11) + speculative decoding V2 as default — affects all SGLang inference deployments.

Sources