XOOMAR
Futuristic AI lab with glowing noise particles forming parallel data blocks across neural network screens
TechnologyJune 11, 2026· 7 min read· By XOOMAR Insights Team

1,000 Tokens a Second: DiffusionGemma Breaks LLM Math

Share
Updated on June 11, 2026

DiffusionGemma can generate about 1,000 tokens per second on a single Nvidia H100, a speed claim that puts Google’s new open model in a different lane from the token-by-token LLMs developers usually deploy.

XOOMAR Intelligence

Analyst Take

79/ 100
High
4 sources analyzedMedium confidenceTrend20Freshness89Source Trust82Factual Grounding95Signal Cluster60

Google released the 26-billion-parameter experimental model on June 10, 2026, with open weights and an Apache 2.0 license, according to The Decoder. The headline is not that Google has another Gemma model. It’s that this one generates text through diffusion, refining blocks of tokens in parallel instead of writing one token after another.

The catch matters. Google says DiffusionGemma is faster than comparable autoregressive models in single-user mode on dedicated GPUs, but its output quality is lower than standard Gemma 4. That makes it a research and developer tool for now, not a drop-in replacement for production LLMs.

“While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma is designed for researchers and developers exploring speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures.”

Why 1,000 tokens per second changes the developer math

Most LLM latency problems come from the same basic constraint: the model has to produce the next token, then the next, then the next. That works well enough for chat, but it can drag when an app needs repeated model calls, local inference, or interactive edits that should feel instant.

Nvidia says DiffusionGemma reaches 1,000 tokens per second on a single H100. Google also claims more than 700 tokens per second on a GeForce RTX 5090, while Nvidia reports 150 tokens per second on DGX Spark. In local single-user mode, the model runs about four times faster on dedicated GPUs than a comparable autoregressive model.

That number only applies in the right setting. Google says the speed advantage is strongest on dedicated accelerators running low-concurrency or single-user workloads. In cloud serving with many parallel requests, autoregressive models can already keep hardware busy, so DiffusionGemma can raise costs instead of cutting them.

That caveat is the whole story. DiffusionGemma is not “faster AI” in every deployment. It’s faster when the old decoding pattern leaves GPU compute underused. For teams wrestling with serving efficiency, that distinction echoes the GPU burn problem we covered in Ray Serve vs Triton: Pick Wrong and GPUs Burn Cash.


How does DiffusionGemma generate text from noise?

An autoregressive model generates text sequentially. Each new token depends on the tokens before it. The model cannot see words it has not produced yet, so the output moves left to right.

DiffusionGemma starts differently. It begins with a block of 256 random placeholder tokens and refines them across multiple passes until readable text emerges. The idea comes from diffusion image models, which start from noise and iteratively turn it into a coherent image.

That comparison is useful, but text is less forgiving than pixels.

XOOMAR analysis: image diffusion can produce a plausible picture even if some local detail is imperfect. Text has stricter dependencies. Grammar, factual consistency, code syntax, markdown structure, and long-range logic all have to line up. A sentence can collapse because one token is wrong. A code block can fail because a bracket appears in the wrong place.

Google’s answer is bi-directional attention across the block. Because DiffusionGemma works on up to 256 tokens in parallel, each token can attend to tokens before and after it during generation. That is the technical reason Google sees promise in non-linear tasks such as in-line editing, code infilling, amino acid sequences, and mathematical graphs.

A 26B model that only activates 3.8B parameters per step

DiffusionGemma has 26 billion parameters in total, but it activates only 3.8 billion parameters per step. That comes from its Mixture-of-Experts architecture, where specialized sub-networks sit inside the model and only some are used for a given input.

That design matters for local hardware. Google says the model can fit inside 18GB of VRAM on high-end consumer GPUs when quantized to lower precision. Nvidia has optimized it for RTX 5090, RTX 4090, Hopper, and Blackwell systems.

Here is the practical split:

Model style Generation pattern Hardware behavior in single-user mode Main trade-off
Autoregressive LLMs One token at a time Often memory-bound Higher quality in standard Gemma 4
DiffusionGemma Up to 256 tokens refined in parallel Pushes work toward compute Faster, but lower output quality

Nvidia describes the old pattern as memory-bound: the GPU waits on memory bandwidth while compute units sit idle. DiffusionGemma changes the workload by processing many tokens at once, giving the GPU more math to do in parallel.

That also explains why Google is not pitching this as a universal replacement. The architecture fits a specific performance gap: local, low-concurrency inference on dedicated GPUs.

Where rougher but faster text can still be useful

Google’s strongest examples are not polished long-form answers. They’re workflows where seeing the whole block at once gives the model an advantage.

Supported use cases include:

  • In-line editing: Revising or inserting text into an existing passage after the fact.
  • Code infilling: Filling missing sections of code where later context matters.
  • Structured data: Working with amino acid sequences or mathematical graphs.
  • Sudoku-style constraints: Google points to an Unsloth fine-tune where DiffusionGemma solves Sudoku, a task where entries depend on future entries.

The Sudoku example shows why left-to-right generation can be awkward. In a 9x9 Sudoku grid, one choice can depend on cells the model has not generated yet. Google’s example shows the base DiffusionGemma getting 31 cells wrong after 30 denoising steps, while the fine-tuned version solves the puzzle completely.

That doesn’t mean DiffusionGemma is ready for high-stakes user-facing answers. Google says standard Gemma 4 remains the choice when quality matters most. XOOMAR analysis: the safest near-term role is as a fast local draft, edit, or infill engine inside developer workflows where another system or a human can judge the result.

For coding teams, that puts DiffusionGemma closer to workflow infrastructure than chatbot replacement. Its code infilling angle sits near the developer-tool pressure points we discussed in Control Fight Splits Cursor vs Windsurf AI Coding Teams.


The quality gap is the real test

Speed is easy to market. Quality is harder to prove.

Google’s own positioning is careful: DiffusionGemma is experimental, and standard Gemma 4 remains the recommended model for high-quality production output. The Decoder reports that DiffusionGemma runs faster than autoregressive Gemma 4 models but scores lower on accuracy.

Developers can test the model through Hugging Face Transformers, vLLM, and MLX. For fine-tuning, Google points to Hackable Diffusion, Unsloth, and Nvidia NeMo Framework. Support for llama.cpp is planned.

The unresolved question is whether diffusion text models can close the quality gap without giving up the latency advantage. They need better factual reliability, stronger instruction following, more stable long-context behavior, and cleaner control over style.

For now, DiffusionGemma is best read as a serious testbed. If you’re building local interactive tools, code infilling systems, or non-linear text workflows, it’s worth testing. If you need trusted production answers, Google’s own guidance says to stay with standard Gemma 4 and watch whether diffusion-based text generation improves from fast curiosity to dependable engine.

The Bottom Line

  • DiffusionGemma could make local and interactive AI workflows feel much faster for developers.
  • Its lower quality means it is not yet a replacement for production-grade autoregressive LLMs.
  • Open weights and an Apache 2.0 license make it easier for researchers to test diffusion-based text generation.

DiffusionGemma vs. standard autoregressive Gemma models

Model/ApproachGeneration methodSpeed/PerformanceQuality/Use case
DiffusionGemmaRefines blocks of tokens in parallel through diffusionAbout 1,000 tokens/sec on a single Nvidia H100; about 4x faster than comparable autoregressive models on dedicated GPUsLower output quality than Gemma 4; aimed at researchers and speed-critical developer workflows
Gemma 4 / autoregressive modelsGenerates text one token at a timeSlower in single-user dedicated GPU modeStandard for high-quality production outputs

Reported DiffusionGemma speeds by hardware

Nvidia H100
tokens/sec1,000
GeForce RTX 5090
tokens/sec700
DGX Spark
tokens/sec150
XOOMAR

Written by

XOOMAR Insights Team

Research and Editorial Desk

The XOOMAR Insights Team pairs automated research with human editorial judgment. We track hundreds of sources across technology, fintech, trading, SaaS, and cybersecurity, cross-check the facts, and explain what happened, why it matters, and what to watch next. We do not just rewrite headlines. Every article is fact-checked and scored for reliability before it goes live, and we link back to the original sources so you can verify anything yourself.

Related Articles

Creator data streams feeding an abstract music AI in a futuristic tech studio.Technology

Google's Lyria Bet Puts YouTube Musicians on the Hook

Google's Lyria defense could turn YouTube uploads into unpaid AI training data unless creators get consent and compensation.

Jun 10, 20268 min
Futuristic AI subscription marketplace with glowing tool bundles and competing digital assistant platforms.Technology

$4.99 Google AI Plus Rattles ChatGPT's $20 Wall With 400GB

Google’s $4.99 AI Plus plan makes Gemini a budget bundle, forcing ChatGPT and Claude to defend pricier subscriptions.

Jun 10, 20268 min
Premium smartwatches in a futuristic AI workspace, with one older watch dimmed behind a barrier.Technology

AI Siri Lands on Apple Watch — and Locks Out Series 9

AI Siri is coming to Apple Watch, but only five model lines qualify—and users still need an iPhone 15 Pro or newer.

Jun 9, 20267 min
AI inference operations room with GPU racks, orchestration nodes, and cooling visuals for production tradeoffs.Technology

Ray Serve vs Triton: Pick Wrong and GPUs Burn Cash

Ray Serve wins orchestration. Triton wins raw inference. The right call depends on where your production bottleneck really lives.

Jun 9, 202620 min
Two AI coding teams divided between tight control and autonomous codebase management in a futuristic workspace.Technology

Control Fight Splits Cursor vs Windsurf AI Coding Teams

Cursor favors tight control. Windsurf favors autonomous coding across bigger codebases, with privacy and cost shaping the choice.

Jun 9, 202622 min
Geopolitical crisis map showing Middle East connections, strike arcs, and tense radar signals.Global Trends

US Iran Strikes Drag Gulf Allies Into Trump's Ultimatum

US strikes on Iran triggered retaliation against Bahrain, Kuwait and Jordan, widening the crisis as Trump pressures Tehran over talks.

Jun 11, 20266 min
Gold bars on a trading floor with bearish market charts and soft dollar imagery in the backgroundTrading

$4,118 Gold Bounce Fails as Fed Hike Bets Bite Hard

Gold's bounce to $4,118 looks weak as Fed hike odds and Treasury yields keep sellers in control.

Jun 11, 20267 min
Wide establishing shot of Europa beneath a massive Jupiter filling the sky, a small autonomous research lander on cracked blue-white ice, faint aurora-like glow along fractures, distant cryobot cable disappearing into a borehole, awe-filled quiet mood, diFuture Fiction

The Choir Under Europa

In 2079, deaf marine bioacoustician Dr. Mara Venn identifies structured vibrations traveling through Europa’s subsurface ocean—signals produced not by machines, but by a living ecosystem that thinks collectively through resonance. As Earth debates whether the discovery counts as a civilization, a grieving scientist becomes the unlikely translator for a mind that has no language, no individuality, and no concept of the sky.

Jun 11, 202614 min
Futuristic AI data center with abstract finance streams symbolizing infrastructure funding.Technology

$17.5B Amazon Loan Reveals AI's Brutal Cash Hunger

Amazon secured a $17.5B delayed-draw loan, giving it flexible debt firepower as AI infrastructure costs climb.

Jun 11, 20265 min
Cinematic tribute to an Australian filmmaker with globe connections and award spotlightGlobal Trends

At 81, Peter Weir Grabs AFTRS' First Lifetime Award

AFTRS made Peter Weir its first lifetime honoree, turning a tribute into a benchmark for Australian screen legacy.

Jun 11, 20267 min

Don't miss the signal

Get our weekly roundup of the stories that matter across tech, fintech, and trading. No noise, just signal.

Free forever. No spam. Unsubscribe anytime.