Orthrus: 7.8x Inference Speedup for Qwen3 via Autoregressive-Diffusion KV Sharing
Orthrus (arXiv 2605.12825) combines a frozen pretrained autoregressive LLM with a lightweight trainable diffusion module sharing the same KV cache, enabling parallel token generation with an exact intra-model consensus mechanism that produces lossless output. Applied to Qwen3 (1.7B, 4B, 8B), it achieves up to 7.8x tokens-per-forward-pass speedup with O(1) additional memory overhead. The GitHub implementation trended on Hacker News (34 points) and GitHub Python trending May 15–16.
Why it matters
Sharing the KV cache between autoregressive and diffusion heads is a novel alternative to speculative decoding that avoids the draft-model overhead. The O(1) memory claim makes it feasible for consumer hardware. Qwen3 compatibility is timely given the model family's current widespread adoption.
Importance: 3/5
Novel architecture combining AR and diffusion heads with shared KV cache; 7.8x speedup; trending on Hacker News and GitHub