Google releases Gemma 4 MTP drafters to speed up local LLM inference

Google introduced Gemma 4 MTP drafters aimed at improving on-device and local LLM inference throughput and latency.

Google says it’s pushing the Gemma 4 open model family further on performance by shipping **Multi‑Token Prediction (MTP) drafters**, a speculative-decoding companion model designed to reduce end‑user latency.

## What’s new

- **MTP drafters for Gemma 4**: lightweight draft models that propose multiple future tokens.

- **Speculative decoding workflow**: a larger “target” model verifies proposed tokens in parallel; if they match, it accepts the whole sequence in a single forward pass.

- **Claimed result**: Google reports **up to ~3× tokens‑per‑second speedups** on a variety of stacks (LiteRT‑LM, MLX, Hugging Face Transformers, vLLM), while keeping output identical because verification is performed by the primary model.

## Why it matters for developers

Inference speed is frequently the main constraint for:

- **Agentic apps** that need quick multi-step planning loops

- **Voice and chat UX** where pauses break the experience

- **On‑device AI** where compute and battery are limited

## How to try it

Google says the MTP drafters are released under the same **Apache 2.0** license as Gemma 4 and can be used across common runtimes (Transformers, MLX, vLLM, SGLang, Ollama), with weights available via Hugging Face and Kaggle.

*Source: Google Blog. This post summarizes claims and technical description from the original article; developers should benchmark on their own target hardware and batch sizes.*

Google releases Gemma 4 MTP drafters to speed up local LLM inference

Google releases Gemma 4 MTP drafters to speed up local LLM inference