NVIDIA speeds Google DeepMind DiffusionGemma for local Artificial Intelligence

Google DeepMind’s DiffusionGemma uses diffusion-style parallel text generation instead of token-by-token output. NVIDIA says its optimizations make the open model faster across local RTX, RTX PRO and DGX systems.

Google DeepMind released DiffusionGemma, an experimental open model designed for exceptionally fast text generation, and NVIDIA has optimized it across GeForce RTX GPUs, the NVIDIA RTX PRO platform and NVIDIA DGX Spark systems. Instead of generating text one word at a time, DiffusionGemma generates multiple words in parallel to produce whole blocks of text, targeting low-latency single-user workloads for developers, researchers and Artificial Intelligence enthusiasts.

DiffusionGemma denoises up to 256 tokens per step instead of predicting one at a time. DiffusionGemma is built on Gemma 4, a 26-billion-parameter mixture-of-experts model that activates just 3.8 billion parameters per step, pairing a diffusion head with Google’s Gemma 4 architecture. The model is open weights under a permissive Apache 2.0 license and runs entirely on RTX and DGX Spark, with support in Hugging Face Transformers, vLLM and Unsloth.

Most large language models in wide use are autoregressive, generating text one token at a time with each new word depending on the previous one. DiffusionGemma takes a different approach by starting from noise and refining a whole block of text at once, using a process closer to how diffusion models generate images. That block-based design is intended to benefit interactive chat, agentic loops and on-device assistants that need faster response times.

DiffusionGemma delivers 1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU, 150 tokens/sec on NVIDIA DGX Spark and up to 2,000 tokens/sec on NVIDIA DGX Station, roughly 4x faster than an equivalent autoregressive model running in the same single-user regime. The model also runs locally on the NVIDIA DGX Spark deskside personal Artificial Intelligence supercomputer, powered by the NVIDIA GB10 Grace Blackwell Superchip with 128GB of unified memory, as well as on NVIDIA RTX PRO 6000 workstations, DGX Station with 748GB of coherent memory and GeForce RTX GPUs with llama.cpp support coming soon.

Developers can test and prototype DiffusionGemma through Hugging Face Transformers, which runs the model on a GeForce RTX 5090 or DGX Spark out of the box. Higher-throughput inference is supported through vLLM, while fine-tuning is available through Unsloth and the NVIDIA NeMo framework, with DGX Spark playbooks for setting up local environments.

74

Impact Score

NVIDIA outlines Halos safety foundation for robotaxis

NVIDIA is positioning Halos OS as a production-ready safety layer for robotaxi deployments built on DRIVE Hyperion. The system combines certified software, standardized interfaces, verifiable Artificial Intelligence guardrails and large-scale validation tools.

Semiconductor revenue posts record growth in 1Q26

Semiconductor revenue grew 27% in 1Q26 from 4Q25, marking the strongest quarter-over-quarter increase Omdia has tracked. Memory revenue led the rise, while Artificial Intelligence-related demand and supply-demand imbalances remained key market forces.

Banking CISOs face artificial intelligence governance gap

Banking security leaders are moving quickly to formalize Artificial Intelligence oversight as business deployments and examiner scrutiny increase. Microsoft Copilot, agentic platforms, and third-party tools are turning governance gaps into operational risk.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.