JetSpec: Parallel Tree Drafting Achieves 9.64× Speculative Decoding Speedup

Hao AI Lab, UCSD

Research official + media 3 src. ~1 min

JetSpec introduces a causal parallel draft head that resolves the causality-efficiency dilemma in speculative decoding. Standard tree-based drafters either draft autoregressively (accurate but slow) or in one parallel pass (fast but incoherent). JetSpec trains a draft head over the target model's fused hidden states so candidate-tree token scores follow the target's autoregressive factorization, then verifies the full tree in a single forward pass. On coding and math benchmarks it achieves up to 9.64× speedup over standard autoregressive decoding on H100/B200 GPUs. Code is open-sourced.

Why it matters

Prior speculative decoding methods hit a speedup ceiling as draft budgets grow larger; JetSpec maintains gains beyond that limit. Reported 1000+ tokens/second on math tasks makes it immediately relevant for production LLM serving.

Importance: 3/5

Top HF Daily paper June 28 (81 upvotes); 9.64× inference speedup with open code, directly actionable for production stacks

Sources