Google recently released Multi-Token Prediction drafters for Gemma 4. By using a technique called speculative decoding, they’ve achieved a 3x speedup in inference with zero loss in quality. The change targets a core bottleneck in large language model serving, where generation slows because the model produces output one token at a time during decoding.
Inference is described as having two phases. The prefill phase reads the prompt in parallel and is compute-bound, while the decode phase writes back one token at a time and is memory-bandwidth bound. For a 31B parameter model, the GPU has to load ~62GB of data from memory just to generate one token fragment. Much of the delay comes from the GPU waiting on memory rather than doing useful computation, which creates the sluggish behavior often seen in model responses.
Large language models avoid re-reading the entire prompt on every generated token by using a KV cache, which stores intermediate states from previous tokens in VRAM. This acts as a form of short-term memory, allowing the model to reference prior context without recomputing the full conversation history. The cache still grows linearly with conversation length. In Gemma 4’s architecture, the smaller drafter shares this KV cache with the main model, avoiding extra time and memory spent rebuilding context.
Speculative decoding works by pairing the main Gemma 4 model with a much smaller drafter. The drafter quickly predicts the next 4-5 tokens, then the larger model verifies those guesses in a single parallel pass. Because that verification step is parallel, checking 5 tokens takes almost the same time as generating 1 token from scratch. When the guesses are correct, the result is a substantial speedup; when they are wrong, the guesses are discarded without making the system slower than standard decoding.
For system design discussions around large language model latency, the recommended path starts with simpler optimizations such as using a smaller quantized model, reducing output tokens, or cutting agent steps. Product design should also emphasize Time to First Token when streaming is enabled. Even if the total response takes 5 seconds, showing the first word in 300ms makes it feel instant. A more advanced explanation centers on arithmetic intensity: speculative decoding improves the ratio of computation to memory access by giving the GPU a batch of tokens to verify at once.
