Question 1

How does a diffusion model generate an image?

Accepted Answer

A diffusion model starts from pure random noise and gradually removes it over many steps, at each step predicting and subtracting a little of the noise. After enough denoising steps, the random static resolves into a coherent image. The model learned to do this by being trained to reverse a process that slowly added noise to real images.

Question 2

What are forward and reverse diffusion?

Accepted Answer

Forward diffusion is the training-time process of gradually adding small amounts of Gaussian noise to a real image until it becomes pure noise, modelled as a Markov chain. Reverse diffusion is the generation process: the neural network learns to undo each noising step, starting from noise and denoising back toward a clean image. The model is trained to predict the noise that was added at each step.

Question 3

How does text guide what a diffusion model creates?

Accepted Answer

Text prompts are turned into embeddings, often using a model like CLIP, that capture the meaning of the words. These embeddings are fed into the denoising network as a condition, so at every step the model removes noise in a direction consistent with the prompt. This text conditioning is what lets you type a description and get a matching image.

Question 4

Why is it called latent diffusion?

Accepted Answer

Running diffusion directly on full-resolution pixels is very expensive. Latent diffusion, used by Stable Diffusion, first compresses images into a smaller latent space with an autoencoder, runs the diffusion process there, then decodes the result back to pixels. This dramatically reduces compute and memory, which is what made high-resolution open-source image generation practical.

What Is a Diffusion Model? How AI Generates Images From Noise

The core idea: turning noise into images

Forward diffusion: adding noise

Reverse diffusion: denoising

Text conditioning with CLIP

Latent diffusion and why it matters