Google Releases DiffusionGemma: 26B Open Model with 4× Faster Text Generation
Google DeepMind
Google released DiffusionGemma, an experimental 26B Mixture-of-Experts open model (Apache 2.0) that uses text diffusion instead of autoregressive token generation. Rather than producing one token at a time, it generates and refines a 256-token block in parallel, achieving up to 4× faster throughput: 1,000+ tokens/sec on an H100 and 700+ on a GeForce RTX 5090. Only 3.8B parameters are active during inference, and the quantized model fits within 18 GB VRAM for consumer GPU deployment. Output quality is lower than standard Gemma 4, making it suited for speed-critical interactive workflows rather than quality-first applications.
Why it matters
One of the first production-viable open-weights text diffusion models. The architectural shift from sequential to parallel block generation removes memory bandwidth as the primary bottleneck and enables bidirectional attention across generated tokens — impossible in autoregressive models. Open Apache 2.0 release on consumer hardware accelerates research into diffusion-based LLMs.
Importance: 4/5
Flagship Google open model; first production-viable open-weights text diffusion architecture; runs on consumer GPU. Novel class of text generation model released openly.