Google DeepMind Releases Gemma 4 QAT Checkpoints: Sub-1 GB On-Device E2B Model

Google DeepMind

Models / LLM official + media 3 src. ~1 min

Google DeepMind released Quantization-Aware Training (QAT) checkpoints for the full Gemma 4 family on June 5. A new mobile QAT format cuts the E2B (2B) model to under 1 GB RAM (from 9.6 GB in BF16), while Q4_0 QAT reduces E2B from 9.6 GB to 3.2 GB and E4B from 15 GB to 5 GB. Weights ship on Hugging Face with immediate support in llama.cpp (b9549+ adds Gemma 4 MTP support), Ollama, LM Studio, vLLM, MLX, and LiteRT-LM.

Why it matters

Sub-1 GB capable models unlock deployment on mid-range phones and microcontrollers. QAT reduces the typical quality cliff of aggressive quantization, making compact Gemma 4 models viable for production on-device applications — a milestone for edge AI.

Importance: 3/5

Official Google DeepMind blog + 2 independent media confirmations; first open-weights multimodal family to hit sub-1 GB on-device threshold with retained quality.

Sources