The core idea: turning noise into images
A diffusion model is the kind of AI behind text-to-image systems like Stable Diffusion, DALL-E, and Midjourney. Its central trick is counterintuitive: it learns to generate images by learning to remove noise. Generation starts not from a blank canvas but from a field of pure random static, and the model progressively cleans that static up — over dozens of small steps — until a coherent picture emerges. The model knows how to do this because it was trained on the opposite task: taking real images and watching them dissolve into noise. Learn to reverse that dissolution and you can manufacture new images out of nothing but randomness and a prompt.
Forward diffusion: adding noise
The training process begins with forward diffusion. Take a real image and, step by step, add a small amount of Gaussian noise. Repeat this many times and the image steadily degrades until, after enough steps, it is indistinguishable from pure random noise. This sequence is a Markov chain: each noisy version depends only on the one immediately before it. Crucially, the amount of noise added at each step is known and controlled, which gives the model a clean supervised signal to learn from — at any step, the model can be told exactly how much noise was introduced.
Reverse diffusion: denoising
Generation runs the chain backwards. Reverse diffusion starts from pure noise and asks a neural network — typically a U-Net — to predict the noise present at the current step so it can be subtracted, nudging the sample slightly closer to a clean image. Repeat across all the steps and the noise resolves into a realistic picture. The network was trained to do exactly this prediction during forward diffusion, so at inference it can denoise samples it has never seen. The number of denoising steps trades speed against quality: fewer steps are faster but coarser, more steps are slower but sharper.
Text conditioning with CLIP
A plain diffusion model would produce random plausible images; what makes them controllable is conditioning. Your text prompt is converted into an embedding — frequently by a model such as CLIP, which was trained to place matching images and captions close together in a shared space. That embedding is injected into the denoising network (via cross-attention) so that at every step the noise is removed in a direction consistent with the prompt’s meaning. This is why typing “a watercolour fox in a snowy forest” steers the emerging image toward exactly that description rather than something arbitrary.
Latent diffusion and why it matters
Running diffusion directly on millions of pixels is computationally brutal. Latent diffusion, the approach popularised by Stable Diffusion, solves this by first compressing images into a much smaller latent space using an autoencoder, performing the entire noising-and-denoising process in that compact space, and only decoding back to full-resolution pixels at the end. Because the heavy diffusion work happens on small latent representations rather than raw images, it needs far less memory and compute. This single optimisation is what brought high-quality image generation out of large research labs and onto consumer GPUs, fuelling the explosion of open-source text-to-image tools.