DFlash accelerates large language model inference with block diffusion

DFlash uses block-diffusion speculative decoding to reduce large language model inference latency while keeping the target model as verifier. The workflow covers draft-model training, FlashAttention integration, and deployment through Regolo Custom Models.

DFlash is a speculative decoding technique that uses a lightweight block diffusion draft model to propose future tokens in parallel while the larger target large language model verifies the output, keeping generation lossless. Standard autoregressive decoding remains sequential because each new token depends on the previous token, which makes decode latency and GPU memory bandwidth major production bottlenecks. Unlike traditional autoregressive drafters, it uses a single forward pass and bidirectional attention to generate 8-16 tokens simultaneously, achieving up to 3x speedups.

The key architectural shift is in the draft phase. Instead of asking a drafter to predict token after token, DFlash fills a masked block at once using bidirectional attention inside the block. The draft model is also conditioned on hidden features extracted from the target model, then those fused features are injected into the Key/Value projections of every draft layer. Baseten explains the trade-off clearly: a single DFlash draft pass can be slower than a single EAGLE draft pass, but DFlash predicts 8 to 16 tokens at once while EAGLE predicts one token per draft pass.

Z Lab maintains public DFlash draft models for Qwen, Llama, Gemma, Kimi, and GPT-OSS families on Hugging Face. These components are not standalone chat models and must be paired with their intended target models. A mismatched drafter can reduce acceptance rates and remove the performance advantage, so teams generally need a compatible drafter for each target model or model family.

The implementation path centers on installing transformer and FlashAttention dependencies, patching Qwen attention layers with a Dynamic FlashAttention module, testing CUDA inference, and publishing the resulting model to Hugging Face before deploying it through Regolo Custom Models. Full variable-length FlashAttention support requires cumulative sequence lengths, maximum sequence lengths, the varlen function, and unpad or pad preprocessing. The main trade-off is added system complexity across target models, draft models, hidden-state extraction, and runtime speculative decoding, making workload-specific latency and acceptance testing essential before production use.

58

Impact Score

NVIDIA outlines Halos safety foundation for robotaxis

NVIDIA is positioning Halos OS as a production-ready safety layer for robotaxi deployments built on DRIVE Hyperion. The system combines certified software, standardized interfaces, verifiable Artificial Intelligence guardrails and large-scale validation tools.

Semiconductor revenue posts record growth in 1Q26

Semiconductor revenue grew 27% in 1Q26 from 4Q25, marking the strongest quarter-over-quarter increase Omdia has tracked. Memory revenue led the rise, while Artificial Intelligence-related demand and supply-demand imbalances remained key market forces.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.