What is DiffusionGemma?

DiffusionGemma is Google’s 26-billion-parameter experimental open model that generates text with diffusion instead of producing one token at a time.

How fast is DiffusionGemma?

Nvidia says DiffusionGemma can generate about 1,000 tokens per second on a single H100 GPU. Google also claims more than 700 tokens per second on a GeForce RTX 5090.

Is DiffusionGemma better than standard Gemma models?

Not for general production quality. The article says standard autoregressive Gemma 4 models remain the high-quality production option, while DiffusionGemma is aimed at speed-critical research and developer workflows.

When is DiffusionGemma’s speed advantage strongest?

Its speed advantage is strongest on dedicated accelerators in low-concurrency or single-user workloads; the article says cloud serving with many parallel requests may not benefit the same way.

DiffusionGemma Smashes 1,000 Tokens a Second, But Lags

Q: How does DiffusionGemma generate text from noise?

It starts with a block of 256 random placeholder tokens and refines them over multiple passes, using bi-directional attention across the block.

Google released the 26-billion-parameter experimental model on June 10, 2026, with open weights and an Apache 2.0 license, according to The Decoder. The headline is not that Google has another Gemma model. It’s that this one generates text through diffusion, refining blocks of tokens in parallel instead of writing one token after another.

The catch matters. Google says DiffusionGemma is faster than comparable autoregressive models in single-user mode on dedicated GPUs, but its output quality is lower than standard Gemma 4. That makes it a research and developer tool for now, not a drop-in replacement for production LLMs.

“While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma is designed for researchers and developers exploring speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures.”

Why 1,000 tokens per second changes the developer math

Most LLM latency problems come from the same basic constraint: the model has to produce the next token, then the next, then the next. That works well enough for AI chatbot ordering or DoorDash AI search, but it can drag when an app needs repeated model calls, local inference, or interactive edits that should feel instant.

Nvidia says DiffusionGemma reaches 1,000 tokens per second on a single H100. Google also claims more than 700 tokens per second on a GeForce RTX 5090, while Nvidia reports 150 tokens per second on DGX Spark. In local single-user mode, the model runs about four times faster on dedicated GPUs than a comparable autoregressive model.

That number only applies in the right setting. Google says the speed advantage is strongest on dedicated accelerators running low-concurrency or single-user workloads. In cloud serving with many parallel requests, autoregressive models can already keep hardware busy, so DiffusionGemma can raise costs instead of cutting them.

That caveat is the whole story. DiffusionGemma is not “faster AI” in every deployment. It’s faster when the old decoding pattern leaves GPU compute underused. For teams wrestling with serving efficiency, that distinction echoes the GPU burn problem we covered in Ray Serve vs Triton: Pick Wrong and GPUs Burn Cash.

How does DiffusionGemma generate text from noise?

An autoregressive model generates text sequentially. Each new token depends on the tokens before it. The model cannot see words it has not produced yet, so the output moves left to right.

DiffusionGemma starts differently. It begins with a block of 256 random placeholder tokens and refines them across multiple passes until readable text emerges. The idea comes from diffusion image models, which start from noise and iteratively turn it into a coherent image.

That comparison is useful, but text is less forgiving than pixels.

XOOMAR analysis: image diffusion can produce a plausible picture even if some local detail is imperfect. Text has stricter dependencies. Grammar, factual consistency, code syntax, markdown structure, and long-range logic all have to line up. A sentence can collapse because one token is wrong. A code block can fail because a bracket appears in the wrong place.

Google’s answer is bi-directional attention across the block. Because DiffusionGemma works on up to 256 tokens in parallel, each token can attend to tokens before and after it during generation. That is the technical reason Google sees promise in non-linear tasks such as in-line editing, code infilling, amino acid sequences, and mathematical graphs.

A 26B model that only activates 3.8B parameters per step

DiffusionGemma has 26 billion parameters in total, but it activates only 3.8 billion parameters per step. That comes from its Mixture-of-Experts architecture, where specialized sub-networks sit inside the model and only some are used for a given input.

That design matters for local hardware. Google says the model can fit inside 18GB of VRAM on high-end consumer GPUs when quantized to lower precision. Nvidia has optimized it for RTX 5090, RTX 4090, Hopper, and Blackwell systems.

Here is the practical split:

Model style	Generation pattern	Hardware behavior in single-user mode	Main trade-off
Autoregressive LLMs	One token at a time	Often memory-bound	Higher quality in standard Gemma 4
DiffusionGemma	Up to 256 tokens refined in parallel	Pushes work toward compute	Faster, but lower output quality

Nvidia describes the old pattern as memory-bound: the GPU waits on memory bandwidth while compute units sit idle. DiffusionGemma changes the workload by processing many tokens at once, giving the GPU more math to do in parallel.

That also explains why Google is not pitching this as a universal replacement. The architecture fits a specific performance gap: local, low-concurrency inference on dedicated GPUs.

Where rougher but faster text can still be useful

Google’s strongest examples are not polished long-form answers. They’re workflows where seeing the whole block at once gives the model an advantage.

Supported use cases include:

In-line editing: Revising or inserting text into an existing passage after the fact.
Code infilling: Filling missing sections of code where later context matters.
Structured data: Working with amino acid sequences or mathematical graphs.
Sudoku-style constraints: Google points to an Unsloth fine-tune where DiffusionGemma solves Sudoku, a task where entries depend on future entries.

The Sudoku example shows why left-to-right generation can be awkward. In a 9x9 Sudoku grid, one choice can depend on cells the model has not generated yet. Google’s example shows the base DiffusionGemma getting 31 cells wrong after 30 denoising steps, while the fine-tuned version solves the puzzle completely.

That doesn’t mean DiffusionGemma is ready for high-stakes user-facing answers. Google says standard Gemma 4 remains the choice when quality matters most. XOOMAR analysis: the safest near-term role is as a fast local draft, edit, or infill engine inside developer workflows where another system or a human can judge the result.

For coding teams, that puts DiffusionGemma closer to workflow infrastructure than chatbot replacement. Its code infilling angle sits near the developer-tool pressure points we discussed in Control Fight Splits Cursor vs Windsurf AI Coding Teams.

The quality gap is the real test

Speed is easy to market. Quality is harder to prove.

Google’s own positioning is careful: DiffusionGemma is experimental, and standard Gemma 4 remains the recommended model for high-quality production output. The Decoder reports that DiffusionGemma runs faster than autoregressive Gemma 4 models but scores lower on accuracy.

Developers can test the model through Hugging Face Transformers, vLLM, and MLX. For fine-tuning, Google points to Hackable Diffusion, Unsloth, and Nvidia NeMo Framework. Support for llama.cpp is planned.

The unresolved question is whether diffusion text models can close the quality gap without giving up the latency advantage. They need better factual reliability, stronger instruction following, more stable long-context behavior, and cleaner control over style.

For now, DiffusionGemma is best read as a serious testbed. If you’re building local interactive tools, code infilling systems, or non-linear text workflows, it’s worth testing. If you need trusted production answers, Google’s own guidance says to stay with standard Gemma 4 and watch whether diffusion-based text generation improves from fast curiosity to dependable engine.

The Bottom Line

DiffusionGemma could make local and interactive AI workflows feel much faster for developers.
Its lower quality means it is not yet a replacement for production-grade autoregressive LLMs.
Open weights and an Apache 2.0 license make it easier for researchers to test diffusion-based text generation.

Model/Approach	Generation method	Speed/Performance	Quality/Use case
DiffusionGemma	Refines blocks of tokens in parallel through diffusion	About 1,000 tokens/sec on a single Nvidia H100; about 4x faster than comparable autoregressive models on dedicated GPUs	Lower output quality than Gemma 4; aimed at researchers and speed-critical developer workflows
Gemma 4 / autoregressive models	Generates text one token at a time	Slower in single-user dedicated GPU mode	Standard for high-quality production outputs

DiffusionGemma Smashes 1,000 Tokens a Second, But Lags

Analyst Take

Why 1,000 tokens per second changes the developer math

How does DiffusionGemma generate text from noise?

A 26B model that only activates 3.8B parameters per step

Where rougher but faster text can still be useful

The quality gap is the real test

The Bottom Line

DiffusionGemma vs. standard autoregressive Gemma models

Reported DiffusionGemma speeds by hardware

Sources

XOOMAR Insights Team

Explore More Topics

Related Articles

AI Collaboration Quietly Rewrites Work Before Layoffs

$1B Google Search Fine Threatens Its Ranking Machine

Missing Gemini 3.5 Pro Overshadows New Gemini Models

Google Opens Android App Stores but Keeps the Cash

Avoiding AI Workshops Turn Libraries Into Big Tech Revolt

Nvidia AI Security Alliance Leaves OpenAI Off Roster

Zippy Puppeteer Ronnie Le Drew Dies at 78 After Illness

Logo Revolt Haunts Cracker Barrel CEO Masino's Exit

53 Years Later, Chile Jails Víctor Jara Murder Fugitive

A $9 Autonomous Key Turns App Addiction Into Real Work

Don't miss the signal