Diffusion Models: From Noise to Art

Diffusion models power image generators like Stable Diffusion, Midjourney, and DALL·E. They work by reversing a gradual noising process—a surprisingly elegant idea rooted in thermodynamics.

The Forward Process: Adding Noise

Start with a clean image. Iteratively add small amounts of Gaussian noise over hundreds of steps until the image becomes pure random noise. This is the forward diffusion process—it's deterministic and simple.

The Reverse Process: Learning to Denoise

A neural network (typically a U‑Net) is trained to predict the noise added at each step. Once trained, you can start from pure noise and iteratively "denoise" to generate a new image. This is the key insight: learn to reverse the diffusion process.

Conditioning: Making It Generate What You Want

By injecting text embeddings (from CLIP or T5) into the U‑Net at each denoising step, the model learns to generate images that match a text description. This is how "a cat wearing a spacesuit" becomes a specific image rather than random noise.

Latent Diffusion: The Efficiency Breakthrough

Stable Diffusion's key innovation: instead of diffusing in pixel space (~millions of dimensions), it diffuses in a compressed latent space (~64×64). A VAE encoder/decoder handles compression. This makes training and inference dramatically faster.

Beyond Images

Diffusion models now generate video (Sora, Runway), audio (Stable Audio), 3D models, and even protein structures. The same principle—learn to reverse a corruption process—applies across modalities.

The Forward Process: Adding Noise

The Reverse Process: Learning to Denoise

Conditioning: Making It Generate What You Want

Latent Diffusion: The Efficiency Breakthrough

Beyond Images

Learn More with AI