DiffusionGemma rethinks text generation with diffusion

DiffusionGemma applies diffusion-style denoising to text, trading autoregressive token-by-token decoding for iterative canvas refinement. Its design combines encoder guidance, bidirectional denoising, scheduling, and entropy-based sampling.

DiffusionGemma shifts text generation away from the standard autoregressive pattern used by Large Language Models. Autoregressive Large Language Models generate one token at a time and can efficiently serve many users because decoding is often memory bound rather than compute bound. The model starts with a sequence of 256 randomly initialized tokens, called a canvas, and tries to choose better tokens for the entire canvas all at the same time. By predicting 256 tokens at the same time, the compute budget of 256 tokens is focused on a single user instead of spreading it across many users.

The approach adapts diffusion, a process associated with image generation, to discrete text. For text, noise cannot be added to a token in the same continuous way as pixels, so DiffusionGemma uses uniform state diffusion rather than only masked diffusion. In forward diffusion, random tokens are used as noise to create a dataset the same way you would do with masked diffusion. In reverse diffusion, the model detects which tokens are noise, proposes replacements across the canvas, accepts confident positions, and re-noises low-probability positions so the canvas stays close to the distribution seen during training.

The solution is not to train from scratch but to use an existing checkpoint as a start instead, namely the Gemma 4 26B A4B model. Gemma 4 26B A4B is a Mixture of Experts model that already went through extensive training and has great performance. The architecture uses an encoder-denoiser patch that lets a decoder-only model switch between encoder mode, which processes the input query, and denoiser mode, which updates the canvas. In denoiser mode, causal attention is replaced with bidirectional attention so each token can attend to all other tokens in the sequence. The model also shares the encoder’s KV cache with the denoiser, allowing the denoising steps to reuse prompt context without recalculating it.

Inference combines iterative diffusion with autoregressive stitching. The canvas in DiffusionGemma has a size of 256 tokens, which isn’t all that big. Specifically, the system first generates the 256 tokens using DiffusionGemma. Those 256 tokens only need to be passed through the encoder once to generate the KV cache after which the denoiser takes a number of steps to fill up this canvas. When it is finished, the prompt is updated with the new 256 tokens and added to the input sequence of the encoder to extend the KV cache. Scheduling controls the maximum denoising steps, logits temperature, and adaptive stopping; in the configuration of DiffusionGemma, the confidence threshold is 0.005 and the stability threshold is 1. The default Entropy Bounded Sampler initializes the canvas with uniformly drawn random tokens, accepts tokens where entropy shows sufficient confidence, and re-noises rejected tokens for later refinement.

58

Impact Score

NVIDIA shows RTX Spark platform at Computex 2026

NVIDIA presented RTX Spark in Taipei as a Windows on Arm platform spanning gaming, creator, and Artificial Intelligence workloads. Microsoft also detailed Windows 11 optimizations built specifically for the new NVIDIA silicon.

AWS enterprise processor targets Artificial Intelligence inference

AWS’s Annapurna Labs-designed enterprise server processor uses a chiplet architecture for cloud infrastructure and Artificial Intelligence inferencing. The design combines Arm compute resources, cache coherency, and high-bandwidth interconnects for AWS deployments.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.