Question 1

How does Stable Diffusion generate images?

Accepted Answer

Stable Diffusion starts from pure random noise and gradually removes it over many steps, guided by your text prompt, until a coherent image emerges. Crucially it does this denoising in a compressed 'latent space' rather than on full-resolution pixels, which is what makes it fast enough to run on consumer GPUs.

Question 2

What is latent diffusion?

Accepted Answer

Latent diffusion runs the diffusion process in a compressed representation produced by a variational autoencoder (VAE) instead of on raw pixels. Working in this much smaller latent space dramatically reduces compute and memory while preserving enough detail to reconstruct a high-quality image at the end.

Question 3

What does CLIP do in Stable Diffusion?

Accepted Answer

CLIP is a text encoder that converts your prompt into a numerical embedding the model understands. This embedding conditions the U-Net denoiser at every step, steering the noise removal toward an image that matches the meaning of your words.

Question 4

What is classifier-free guidance?

Accepted Answer

Classifier-free guidance is a technique that runs the denoiser both with and without the text prompt, then amplifies the difference. A higher guidance scale makes the image follow the prompt more literally at the cost of diversity; a lower scale gives more creative, looser interpretations.

How Does Stable Diffusion Work? Image Generation Explained

Definition

The diffusion idea

Working in latent space: the VAE

Conditioning on text with CLIP

The U-Net and classifier-free guidance