Google Releases DiffusionGemma: 26B Open Model with 4× Faster Text Generation

Google DeepMind

Models / LLM official + media 2 src. ~1 min

Google released DiffusionGemma, an experimental 26B Mixture-of-Experts open model (Apache 2.0) that uses text diffusion instead of autoregressive token generation. Rather than producing one token at a time, it generates and refines a 256-token block in parallel, achieving up to 4× faster throughput: 1,000+ tokens/sec on an H100 and 700+ on a GeForce RTX 5090. Only 3.8B parameters are active during inference, and the quantized model fits within 18 GB VRAM for consumer GPU deployment. Output quality is lower than standard Gemma 4, making it suited for speed-critical interactive workflows rather than quality-first applications.

Why it matters

One of the first production-viable open-weights text diffusion models. The architectural shift from sequential to parallel block generation removes memory bandwidth as the primary bottleneck and enables bidirectional attention across generated tokens — impossible in autoregressive models. Open Apache 2.0 release on consumer hardware accelerates research into diffusion-based LLMs.

Importance: 4/5

Flagship Google open model; first production-viable open-weights text diffusion architecture; runs on consumer GPU. Novel class of text generation model released openly.

gemma diffusion-gemma open-weights text-diffusion local-inference apache2

Sources

official DiffusionGemma: 4× faster text generation — Google Blog

media Google open-sources speedy DiffusionGemma text diffusion model — SiliconAngle