Google’s DeepMind team unveiled an experimental new language model this week that uses techniques originally developed for Artificial Intelligence image generators to boost text output performance by as much as 4x when running on resource-constrained consumer hardware. It is free to download and can run with just 18 GB of DRAM or VRAM. The model, codenamed DiffusionGemma, joins Google’s open weights model family and is aimed at local deployment rather than cloud-scale inference.
But unlike Gemma 4, which launched this spring, the 26 billion-parameter mixture of experts model is not a large language model in a conventional sense. DiffusionGemma is closer to image models such as Stable Diffusion or Flux because it does not generate tokens one after another in an autoregressive sequence. Instead, it generates entire paragraphs of tokens at the same time, starting with a canvas of random tokens and refining them through denoising steps until the final output is reached.
Google is positioning the approach as a way to better use consumer hardware. Conventional large language models are memory-bandwidth bound because active parameters need to be streamed from memory for every token generated, making VRAM and bandwidth major constraints. Diffusion models are described as more compute-bound, which can help high-end graphics cards use excess processing capacity to improve output performance for local inference.
DiffusionGemma also reflects the tradeoffs seen in earlier diffusion language models. According to Google, the 26 billion-parameter model falls just behind Gemma 4 12B in the GPQA-Diamond benchmark, with its main advantage being output speed. The chart shows a roughly 2.25x speedup for DiffusionGemma over the 12B parameter large language model with speculative decode enabled. Compared to Gemma 4 26B-A4B, the speedup is nearly 4x when running a single Nvidia H100.
Google is releasing DiffusionGemma as an experimental model rather than an enterprise focused one. The model is available on repositories including Hugging Face under a highly permissive Apache 2.0 license, with support already merged into vLLM, MLX, and HF Transformers, and Llama.cpp support coming soon. Google has also been leaning more heavily on local inference, including a small large language model shipped with Chrome in May, as a way to reduce cloud costs tied to Artificial Intelligence services.
