So What Are Text Diffusion Models?

Jun 18, 2025

What if you were to ask an LLM this question: "What does ice-cream taste like?"

How do you think it would go about answering this question? Here is ChatGPT's answer (constrained to be an ELI5, and constrained to be one short paragraph in length):

Imagine I’m telling a story one word at a time, peeking at everything I’ve already said to decide the very next word—that peek-and-pick habit is called autoregression. Each “word-chunk” I choose is a token, and predicting it is like picking the next Lego brick that best fits the tower I’ve already built. So if you ask, “What does ice-cream taste like?”, I first look at the words already on the page (“Ice-cream tastes…”) and then, token by token, guess the most likely next bite-sized piece (“like”, “a”, “cold”, “sweet”, “cloud”, etc.), stringing them together until the sentence feels finished. The magic is simply repeating that tiny guess-the-next-brick game really fast, so a full answer appears almost instantly.

You can (and should!) ask your friendly neighborhood LLM to explain this to you in greater detail, of course. But that's a good start to understanding how LLMs work.

But why do we use the term "LLM" when we are talking about AI? Because that is the type of AI that we use when we use ChatGPT, or Gemini or Claude. These are all Large Language Models, and they're all predicting the next, most appropriate "word-chunk". They're all very good at it, they're all getting better, and there is some truly amazing work going on to make all of this work better, cheaper and more efficiently over time...

... but it ain't the only game in town.

What Are Diffusion Models?

What if, instead, the model began with "abcde, fghij, klmno, pqrst" as an answer to the question instead?

The question, remember, is "What does ice-cream taste like".

Don't worry, the model doesn't stop there, and nor is this the answer that you see. In its own head (we're not optimizing for technical accuracy here, obviously, so ignore whatever questions you have about the phrasing here), its first pass answer to your question about the taste of ice-cream is "abcde, fghij, klmno, pqrst"

Now, the challenge, of course, is to figure out how to get from this obviously nonsensical answer to an answer which will satisfy the user.

So what the model does is it uses what it "knows" about ice-cream, and decides that a slightly less "noisy" answer might be the following one:

"abcde, sweet, klmno, pqrst"

Better, obviously, than the first answer, but still nowhere close to perfect. And so it goes through step 3, by which time we have:

"Ice-cream tastes sweet, klmno, creamy"

And so on and so forth, until you end up with an answer that might satisfy the user:

"Ice-cream tastes sweet, smooth and creamy"

(I said "might", remember, for those of you who are planning on writing letters to the editor about smooth and creamy not being tastes).

When you apply this technique in insanely more complicated ways (and there's plenty more where that came from), you get diffusion models.

Now, it is obviously way more complicated than that, and I probably got a lot of things wrong in that super-simplified explanation, but I hope you get the general gist of what a diffusion model is, and how it is different from an LLM.

So We'll All Be Switching to Diffusion Models?

Nope, not any time soon. Gemini has a Diffusion model out (beta, invite only), but wider availability may take a while, and even then there is no guarantee about all of us switching over.

Before we go any further, take a look at the differences between the two models, as per a Deep Research report that I asked Gemini to write:

Autoregressive (AR) Models: These models generate text in a strict, linear sequence, typically from left to right. The generation of each token is conditioned on the sequence of all previously generated tokens. This process is inherently sequential and path-dependent; an error made early in the sequence can propagate and influence all subsequent tokens. The model's task at each step is to predict the single most probable next token.
Diffusion Models: These models operate via holistic, iterative refinement. They begin with a low-fidelity representation of the entire output—essentially random noise—and generate the full block of text in parallel over multiple denoising steps. Instead of predicting the next word, the model's task at each step is to improve the quality and coherence of the entire sequence simultaneously. This is a global, rather than local, generation process

Now, because the process with a diffusion model is more "holistic", they're typically better at maintaining fidelity. That sentence is dangerously close to management gobbledy-gook, so here's another attempt.

Remember how, in the ice-cream example, we spoke about the full nonsensical sentence, and how the full sentence was always under consideration and random parts were "improved" over time? That's holistic. Holistic in opposition to an LLM's approach, which will try to predict the next word, given what has been written so far. Under such an approach, the end of the output is completely unknown to the model, so it is not holistic.

And because the attempt across all steps is to improve the quality and the coherence of the entire sequence as a whole, they're better at maintaining fidelity.

So they're better in that sense - they maintain coherence and fidelity better in comparison to LLMs.

There's a "But" Coming. I Can Feel It In My Bones.

Of course there is a but, and it is an obvious one. These things are expensive! Running a small prompt through a diffusion model is actually cheaper in comparison (best as I can tell, at any rate), but with even slightly longer queries, it can get very expensive, very quickly.

Besides, it is still very early days in Text Diffusion world. Reasoning capability and accuracy are a ways away, alignment and safety training techniques are designed for an LLM world, and diffusion models are compute intensive.

Here's how o3 put it:

Think of a diffusion model as baking a whole cake and then shaving off the burnt bits until it looks perfect—fast per slice but you can’t hand anyone a taste until the whole cake’s ready. Today’s restaurants already own ovens (autoregressive stacks) that pop out cupcakes one by one; swapping kitchens, re-training chefs, and rewriting recipes costs more than the flour you save.
Bottom line: token pricing is only one line on the bill. Until diffusion models close their “first bite” latency, long-context compute, reasoning accuracy and alignment gaps—and until production tools catch up—they’ll remain exciting demos rather than the default engine behind everyday chatbots and enterprise workflows.

But (and if you are counting, this is a countering but, so you get two "buts" for the price of one) the possibilities are quite exciting. For one thing, you get two different approaches that both seem to work, and that's always a good thing. If what your project needs is "holistic" output and a "high-fidelity" one, and costs be damned, great. At least you have an option.

Second, one can imagine cases where you first try to run a prompt through a diffusion model, and then run the output of that step through an LLM, and rinse and repeat in any order to suit your purpose. If quality is your goal, and costs be damned, maybe combining two approaches will give you better results.

Third, for specific use-cases, such as text interpolation, or "middle-filling", a diffusion model will likely work better than an LLM.

Consider these two passages:

The old librarian slammed the ancient book shut, dust billowing into the air. "You must not read the final chapter," she warned, her voice trembling. "Some knowledge is not meant for mortals."
A shiver ran down Elias’s spine, but it was not from fear, but from a burgeoning excitement. He had sought this tome for years, whispered about in hushed tones among scholars as "The Codex Umbra." Each page, he knew, held secrets of the universe, and the final chapter, rumored to be the key to ultimate understanding, beckoned him. The librarian, sensing his resolve, pleaded, "Please, young man, reconsider. The price of such knowledge is too great." But Elias’s thirst for truth was unquenchable. He reached for the book, his fingers brushing against its worn leather cover.
He ignored her warning, his curiosity overpowering his fear. With trembling hands, he turned to the last page, and as his eyes scanned the forbidden text, a shadow fell over the room, and the candles extinguished one by one.

And here's the second one:

The old librarian slammed the ancient book shut, dust billowing into the air. 'You must not read the final chapter,' she warned, her voice trembling. 'Some knowledge is not meant for mortals.'
The young scholar scoffed, his eyes fixed on the worn cover. 'Nonsense,' he retorted, 'knowledge is power, and I fear nothing.' He leaned closer, the whispers of the book's contents seeming to echo in the silence of the library. The librarian pleaded, her face etched with ancient fear, but the allure of the forbidden text was too strong. He felt a pull, an irresistible urge to uncover the truth, believing it held the key to something profound, perhaps even dangerous.
He ignored her warning, his curiosity overpowering his fear. With trembling hands, he turned to the last page, and as his eyes scanned the forbidden text, a shadow fell over the room, and the candles extinguished one by one.

Which text, according to you, has "higher fidelity" and is more "holistic"? The second text was written by Gemini Diffusion.

And the prompt, in both cases, was this:

Here is the beginning and end of a scene. Fill in the middle in not more than 150 words, and what you write there must seamlessly connect the beginning with the end.
Beginning: The old librarian slammed the ancient book shut, dust billowing into the air. 'You must not read the final chapter,' she warned, her voice trembling. 'Some knowledge is not meant for mortals.'
End: He ignored her warning, his curiosity overpowering his fear. With trembling hands, he turned to the last page, and as his eyes scanned the forbidden text, a shadow fell over the room, and the candles extinguished one by one."

Early days, of course, but this is a development worth keeping an eye on, and if you are in the generating coherent text business, you may want to keep more than one eye out.

And it doesn't hurt, of course, to apply for a beta invite for Gemini Diffusion. It is a fun little tool to try out.

EconForEverybody

Discussion about this post