Google releases Gemma 4 MTP drafters to speed up local LLM inference
Google introduced Gemma 4 MTP drafters aimed at improving on-device and local LLM inference throughput and latency.
Google says it’s pushing the Gemma 4 open model family further on performance by shipping **Multi‑Token Prediction (MTP) drafters**, a speculative-decoding companion model designed to reduce end‑user latency.
## What’s new
- **MTP drafters for Gemma 4**: lightweight draft models that propose multiple future tokens.
- **Speculative decoding workflow**: a larger “target” model verifies proposed tokens in parallel; if they match, it accepts the whole sequence in a single forward pass.
- **Claimed result**: Google reports **up to ~3× tokens‑per‑second speedups** on a variety of stacks (LiteRT‑LM, MLX, Hugging Face Transformers, vLLM), while keeping output identical because verification is performed by the primary model.
## Why it matters for developers
Inference speed is frequently the main constraint for:
- **Agentic apps** that need quick multi-step planning loops
- **Voice and chat UX** where pauses break the experience
- **On‑device AI** where compute and battery are limited
## How to try it
Google says the MTP drafters are released under the same **Apache 2.0** license as Gemma 4 and can be used across common runtimes (Transformers, MLX, vLLM, SGLang, Ollama), with weights available via Hugging Face and Kaggle.
*Source: Google Blog. This post summarizes claims and technical description from the original article; developers should benchmark on their own target hardware and batch sizes.*
Source: Google Blog