How Google made Gemma faster with speculative decoding

Google introduced Multi-Token Prediction drafters for Gemma 4 to accelerate inference through speculative decoding. The approach speeds token generation by pairing the main model with a smaller drafter that shares context and verifies multiple guesses in parallel.

Google recently released Multi-Token Prediction drafters for Gemma 4. By using a technique called speculative decoding, they’ve achieved a 3x speedup in inference with zero loss in quality. The change targets a core bottleneck in large language model serving, where generation slows because the model produces output one token at a time during decoding.

Inference is described as having two phases. The prefill phase reads the prompt in parallel and is compute-bound, while the decode phase writes back one token at a time and is memory-bandwidth bound. For a 31B parameter model, the GPU has to load ~62GB of data from memory just to generate one token fragment. Much of the delay comes from the GPU waiting on memory rather than doing useful computation, which creates the sluggish behavior often seen in model responses.

Large language models avoid re-reading the entire prompt on every generated token by using a KV cache, which stores intermediate states from previous tokens in VRAM. This acts as a form of short-term memory, allowing the model to reference prior context without recomputing the full conversation history. The cache still grows linearly with conversation length. In Gemma 4’s architecture, the smaller drafter shares this KV cache with the main model, avoiding extra time and memory spent rebuilding context.

Speculative decoding works by pairing the main Gemma 4 model with a much smaller drafter. The drafter quickly predicts the next 4-5 tokens, then the larger model verifies those guesses in a single parallel pass. Because that verification step is parallel, checking 5 tokens takes almost the same time as generating 1 token from scratch. When the guesses are correct, the result is a substantial speedup; when they are wrong, the guesses are discarded without making the system slower than standard decoding.

For system design discussions around large language model latency, the recommended path starts with simpler optimizations such as using a smaller quantized model, reducing output tokens, or cutting agent steps. Product design should also emphasize Time to First Token when streaming is enabled. Even if the total response takes 5 seconds, showing the first word in 300ms makes it feel instant. A more advanced explanation centers on arithmetic intensity: speculative decoding improves the ratio of computation to memory access by giving the GPU a batch of tokens to verify at once.

58

Impact Score

Artificial Intelligence reshapes the UK entry-level jobs market

The spread of Artificial Intelligence is reducing demand for some junior roles while increasing pressure on employers to build digital skills. Business groups warn that rising costs and automation could deepen youth unemployment and skills shortages across the United Kingdom.

Apple explores Intel and Samsung for chip supply

Apple is weighing Intel and Samsung as potential suppliers for the main processors in its devices as it looks to reduce geopolitical and manufacturing risk tied to Taiwan. The move would extend a broader effort to diversify its supply chain amid tariffs, friend-shoring, and heavy Artificial Intelligence-driven chip demand.

Europe firms struggle to track Artificial Intelligence cyberattacks

European organisations are adopting Artificial Intelligence widely, but many lack the visibility and governance needed to understand whether they have already been targeted by Artificial Intelligence-powered attacks. ISACA’s latest survey points to rising concern over misinformation, privacy, weak policy controls, and a growing skills gap.

Artificial Intelligence reshapes NRO space operations

The National Reconnaissance Office is expanding Artificial Intelligence across satellite and ground systems to speed delivery, improve accuracy, and extend human capabilities. The agency is pairing that push with testing, validation, and workforce development aimed at building trust in mission-critical systems.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.