How Does Stable Diffusion Work? Image Generation Explained

Latent diffusion, CLIP conditioning, and the U-Net denoiser—all decoded

Ad placeholder (leaderboard)

Definition

Stable Diffusion is a text-to-image model based on latent diffusion. It generates pictures by starting from random noise and progressively denoising it over many steps, guided by a text prompt, until a coherent image appears. Its key efficiency trick is that the entire denoising process happens in a compressed latent space rather than on full-resolution pixels, which is why it can run on ordinary consumer hardware while rival pixel-space models cannot.

The diffusion idea

Diffusion models are trained by taking real images and adding noise to them step by step until they become pure static. The model learns to reverse this process: given a noisy image and how much noise it contains, predict the noise so it can be subtracted. At generation time you start from nothing but noise and apply this learned denoiser repeatedly, each step revealing a little more structure, until a clean image emerges.

Working in latent space: the VAE

Running diffusion directly on pixels is enormously expensive. Stable Diffusion sidesteps this with a variational autoencoder (VAE). The VAE’s encoder compresses a high-resolution image into a much smaller latent representation, the diffusion process operates entirely on those latents, and at the end the VAE’s decoder reconstructs the latents back into a full-resolution image. This compression is the single biggest reason Stable Diffusion is fast and accessible.

Conditioning on text with CLIP

To make the output match your prompt, Stable Diffusion uses a CLIP text encoder. CLIP turns your words into an embedding — a vector capturing their meaning — which is injected into the denoiser at every step through a mechanism called cross-attention. This conditioning steers each denoising step toward latents that, once decoded, depict what you described.

The U-Net and classifier-free guidance

The denoiser itself is a U-Net, a neural network that takes the noisy latent, the current timestep, and the text embedding, and predicts the noise to remove. To strengthen prompt adherence, Stable Diffusion uses classifier-free guidance: it runs the U-Net both with and without the prompt and extrapolates the difference. The guidance scale controls how strongly the prompt is enforced — higher values follow your text more literally but reduce diversity, while lower values produce looser, more varied interpretations. Together, latent diffusion, CLIP conditioning, the U-Net, and guidance turn a string of text into a detailed image in seconds.

Ad placeholder (rectangle)