On Tuesday, June 10, Google quietly released DiffusionGemma, a 26B mixture-of-experts Gemma model that does not predict the next token at all. It starts with a canvas of 256 random placeholder tokens, literal noise, and denoises the whole block into readable text over a handful of passes, finalizing 15-20 tokens per forward pass. The approach makes DiffusionGemma closer in process to image diffusion systems than to standard left-to-right language models.
On a single H100 it streams 1,000+ tokens a second; Google’s own chart shows 1,107 tok/s against 303 tok/s for the same-size autoregressive Gemma 4. That is a 3.7x throughput jump from the same 26B-A4B backbone, the same hardware, and an Apache 2.0 license available to download today. The reported gain is tied to a model that works across a block of text rather than committing to one token at a time.
Every LLM you have ever used, GPT-5.5, Claude, Gemini, Llama, DeepSeek, writes the way a typewriter does: one token, then the next, each conditioned on everything before it. DiffusionGemma writes the way Stable Diffusion paints. The model can un-commit a token it got wrong, it can attend to words it has not finished writing yet, and it can revise within the generated block as denoising proceeds.
The capability matters beyond raw speed because it changes the structure of generation. A fine-tuned version went from solving roughly 0% of Sudoku puzzles to 80%, a task autoregressive models structurally fumble. That result frames DiffusionGemma as more than a throughput experiment: it suggests open models may gain new problem-solving behavior when they are allowed to generate, revise, and settle text from noise rather than moving strictly forward token by token.
