BlockPilot: Instance-Adaptive Block Size for Diffusion-Based Speculative Decoding
BlockPilot shows that the optimal block size in diffusion-based speculative decoding varies per input and formulates block size selection as a lightweight policy learned from the prefilling representation. Applied to Qwen3-4B, it achieves an acceptance length of 5.92 tokens and a 4.20× inference speedup at temperature T=1, with negligible overhead, and is plug-and-play on top of existing speculative decoding systems.
Why it matters
67 upvotes on HuggingFace Daily Papers (July 1). Demonstrates that static block size is a meaningful source of inefficiency in speculative decoding and provides a practical, low-overhead fix with 4× speedup.
Importance: 2/5
67 HF Daily Papers upvotes; 4.2× inference speedup via adaptive block sizing, plug-and-play on existing speculative decoding