BlockPilot: Instance-Adaptive Block Size for Diffusion-Based Speculative Decoding

Research official + media 2 src. ~1 min

BlockPilot shows that the optimal block size in diffusion-based speculative decoding varies per input and formulates block size selection as a lightweight policy learned from the prefilling representation. Applied to Qwen3-4B, it achieves an acceptance length of 5.92 tokens and a 4.20× inference speedup at temperature T=1, with negligible overhead, and is plug-and-play on top of existing speculative decoding systems.

Why it matters

67 upvotes on HuggingFace Daily Papers (July 1). Demonstrates that static block size is a meaningful source of inefficiency in speculative decoding and provides a practical, low-overhead fix with 4× speedup.

Importance: 2/5

67 HF Daily Papers upvotes; 4.2× inference speedup via adaptive block sizing, plug-and-play on existing speculative decoding

Sources