Text Diffusion Model June 2026 guide
Non-autoregressive text generation, explained

DiffusionGemma is Google's experimental text diffusion model built on Gemma 4.

DiffusionGemma generates text by refining a block of tokens in parallel instead of writing one token at a time. This guide explains the architecture, the serving path, and the fastest way to start experimenting.

Creator
Google DeepMind
Launch
June 2026
Base
Gemma 4 MoE
256-token noisy canvas parallel token refinement Denoising pass Gemma 4 KV cache → committed text block then prefill the next canvas
Primary source
The official developer guide explains the Gemma 4 MoE backbone, 256-token canvases, and vLLM serving path.
Core mechanic
The model starts with placeholder tokens and iteratively denoises the full canvas with bidirectional context.
Overview

What makes DiffusionGemma different from autoregressive language models

Built on Gemma 4 MoE

DiffusionGemma uses a 26B Mixture-of-Experts Gemma 4 backbone while activating roughly 4B parameters during inference.

Open weights for developers

The model weights are available on Hugging Face, giving researchers and builders a direct path to local experiments.

Parallel block generation

Instead of loading weights for every next token, it refines a 256-token canvas in parallel and shifts more work to compute.

This page is built for fast understanding

If you want the definition, the architecture, and the shortest path to serving DiffusionGemma, the key details are all here.

How It Works

The diffusion process denoises blocks of text instead of decoding strictly left to right

DiffusionGemma combines causal prefill for committed context with bidirectional denoising over the current token canvas.

01

Prefill the prompt context

The model ingests the prompt with causal attention and writes the prompt context into the KV cache before denoising starts.

02

Initialize a token canvas

The sampler starts with a 256-token canvas of placeholders, then updates all positions in parallel over multiple denoising steps.

03

Bidirectional denoising

During denoising, each canvas position can attend to the other positions, which enables global context propagation and self-correction.

04

Commit and continue

When a block stabilizes, it is committed to the KV cache and the next 256-token canvas is initialized for longer generations.

Why It Matters

It brings diffusion-style decoding to open text generation.

DiffusionGemma changes the generation loop: it refines whole token blocks, uses bidirectional context inside each block, and can revisit uncertain positions before committing output.

Gemma 4 backbone
The model uses the same architectural family as Gemma 4, with a 26B MoE shape and roughly 4B active parameters.
Apache 2.0
Google released DiffusionGemma as an experimental open model under Apache 2.0.
Serving-ready path
The developer guide points to vLLM and Hugging Face paths so teams can test the model through familiar APIs.
Deep Dive

DiffusionGemma: the official developer guide

Google's guide explains the block-autoregressive denoising loop, the 256-token canvas, and the deployment path for developers.

Read the guide
Quick Start

How to go from reading about DiffusionGemma to serving text

The simplest path is to run the Hugging Face model with vLLM, then call the OpenAI-compatible chat endpoint from your own scripts or tools.

01

Install dependencies

Install vLLM in a Python environment with a supported GPU stack.

pip install vllm
02

Serve the model

Start a local OpenAI-compatible endpoint using the official Hugging Face model ID.

vllm serve "google/diffusiongemma-26B-A4B-it"
03

Call the chat API

Send a normal chat-completions request while the server handles block denoising internally.

curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/diffusiongemma-26B-A4B-it", "messages": [{"role":"user","content":"Explain block diffusion in one paragraph."}] }'
04

Explore serving tradeoffs

Tune denoising steps, max tokens, batching, and hardware choice for latency-sensitive text workflows.

# Watch throughput and latency while changing: # - max output length # - denoising steps # - batch size # - GPU target
FAQ

The fastest answers to the questions people ask first

Start here if you want the creator, the architecture, or the serving path without reading the full guide first.

Who created DiffusionGemma?
DiffusionGemma comes from Google. The developer guide presents it as an experimental text diffusion model built on Gemma 4 for non-autoregressive language generation.
Is DiffusionGemma a text-to-image model?
No. DiffusionGemma is a text generation model. It borrows the diffusion idea of iterative denoising, but the canvas is a block of text tokens rather than pixels or latent images.
How does generation differ from a normal LLM?
Autoregressive LLMs predict one next token at a time. DiffusionGemma initializes a block of tokens, denoises many positions in parallel, then commits that block before continuing.
Where can developers run it from?
The official guide points developers to the Hugging Face model and vLLM serving path. Review the model card and Apache 2.0 license terms before production use.