JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Hao AI Lab, UC San Diego

Research official + media 2 src. ~1 min

JetSpec introduces a causal parallel draft head that aligns candidate token-tree scores with the target model's autoregressive factorization, solving the longstanding tradeoff between autoregressive and bidirectional drafters. It achieves up to 9.64× speedup on MATH-500 and 4.58× on conversational workloads using Qwen3 models on H100/B200 GPUs, with vLLM integration and released draft models on HuggingFace.

Why it matters

Speculative decoding has plateaued because larger draft budgets did not reliably yield longer accepted sequences. JetSpec breaks this ceiling with a principled training objective, delivering >1,000 tokens/second throughput — practically significant for inference cost reduction at any scale.

Importance: 3/5

9.64× speedup on speculative decoding with principled training objective that resolves the draft-budget scaling failure

inference speculative-decoding efficiency benchmark paper vllm

Sources

official JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting — arXiv

media AIを最大9.64倍高速化する投機的デコーディング手法「JetSpec」が開発される — Gigazine