Orthrus: 7.8x Inference Speedup for Qwen3 via Autoregressive-Diffusion KV Sharing

Research official 2 src. ~1 min

Orthrus (arXiv 2605.12825) combines a frozen pretrained autoregressive LLM with a lightweight trainable diffusion module sharing the same KV cache, enabling parallel token generation with an exact intra-model consensus mechanism that produces lossless output. Applied to Qwen3 (1.7B, 4B, 8B), it achieves up to 7.8x tokens-per-forward-pass speedup with O(1) additional memory overhead. The GitHub implementation trended on Hacker News (34 points) and GitHub Python trending May 15–16.

Why it matters

Sharing the KV cache between autoregressive and diffusion heads is a novel alternative to speculative decoding that avoids the draft-model overhead. The O(1) memory claim makes it feasible for consumer hardware. Qwen3 compatibility is timely given the model family's current widespread adoption.

Importance: 3/5

Novel architecture combining AR and diffusion heads with shared KV cache; 7.8x speedup; trending on Hacker News and GitHub

inference diffusion speculative-decoding qwen open-source efficiency

Sources

official Orthrus: Lossless LLM Inference Acceleration via Intra-Model Consensus — arXiv

official chiennv2000/orthrus — GitHub