How Transparent is DiffusionGemma? Interpretability Study Closes the Gap to Autoregressive Models
Google DeepMind
This paper investigates whether DiffusionGemma — a masked discrete-diffusion LM that reasons in continuous latent space — is harder to interpret than autoregressive models. By mapping intermediate denoising states through an interpretable token bottleneck, the authors reduce the apparent transparency gap from 28.6× to just 1.1× relative to Gemma 4, and identify diffusion-specific phenomena such as non-chronological reasoning and token smearing. Co-authored by Neel Nanda and Rohin Shah.
Why it matters
First systematic mech-interp study of a production-scale diffusion language model, with direct implications for AI safety monitoring as diffusion LMs gain adoption.
Importance: 3/5
First mech-interp study of a production diffusion LM; authored by Neel Nanda and Rohin Shah; closes a critical gap in monitorability research for diffusion-based inference.