Skip to content

Diffusion Models

Bit: Diffusion models literally learn to un-destroy images — start with pure noise, progressively denoise until an image emerges. It's like teaching AI to reverse entropy.


★ TL;DR

  • What: Generative models that create images by learning to reverse a gradual noise-addition process
  • Why: Surpassed GANs in image quality and stability. Power Stable Diffusion, DALL-E, Midjourney
  • Key point: Forward process = add noise step by step until pure noise. Reverse process = learn to denoise step by step.

★ Overview

Definition

Diffusion Models (specifically Denoising Diffusion Probabilistic Models — DDPMs) are generative models that learn to generate data by reversing a gradual noising process. Training teaches the model to denoise slightly noisy images at each step; generation starts from pure noise and iteratively denoises.

Scope

Covers diffusion model theory, architecture, and key models. For practical image generation tools, see individual model pages.

Significance

  • Replaced GANs as the dominant image generation architecture (2022+)
  • Power: Stable Diffusion, DALL-E 2/3, Midjourney, Imagen
  • Extended to: Video (Sora, Veo), Audio, 3D, and even molecular design
  • More stable training than GANs (no mode collapse)

Prerequisites

  • Transformers — U-Net and attention are used inside diffusion models
  • Basic probability concepts

★ Deep Dive

The Core Idea

FORWARD PROCESS (Training — add noise):

  Clean Image → Slightly Noisy → More Noisy → ... → Pure Gaussian Noise
  x₀           x₁               x₂              xₜ

  Each step: xₜ = √(αₜ)·xₜ₋₁ + √(1-αₜ)·ε    (ε ~ Normal(0,1))

REVERSE PROCESS (Generation — remove noise):

  Pure Noise → Slightly Less Noisy → ... → Clean Image!
  xₜ          xₜ₋₁                    x₀

  The model learns: "Given noisy image xₜ, predict the noise ε"
  Then subtract that predicted noise to get xₜ₋₁

                      ┌──────── FORWARD (destroy) ────────┐
                      │                                    │
  [Clean Image] → [Noisy] → [Noisier] → ... → [Pure Noise]
  [Clean Image] ← [Noisy] ← [Noisier] ← ... ← [Pure Noise]
                      │                                    │
                      └──────── REVERSE (create) ─────────┘

Architecture: U-Net + Attention

Diffusion Model Architecture:

  Noisy Image ──► [U-Net with Attention] ──► Predicted Noise
       +
  Time Step t  ──► (embedded and injected)
       +
  Text Prompt  ──► (cross-attention with text encoder output)

U-Net Structure:
  ┌───────────────────────────────────┐
  │ Encoder                           │
  │  ↓ Downsample + Attention blocks  │
  ├───────────────────────────────────┤
  │ Bottleneck (Attention)            │
  ├───────────────────────────────────┤
  │ Decoder                           │
  │  ↑ Upsample + Skip connections    │
  └───────────────────────────────────┘

Text-to-Image Pipeline (Stable Diffusion)

"A cat astronaut on Mars"
┌────────────┐    ┌───────────────┐    ┌──────────┐
│ Text       │    │ Diffusion     │    │ VAE      │
│ Encoder    │ →  │ U-Net         │ →  │ Decoder  │ → Image!
│ (CLIP)     │    │ (in latent    │    │ (latent  │
│            │    │  space, not   │    │  → pixel)│
└────────────┘    │  pixel space) │    └──────────┘
                  └───────────────┘
                  Runs ~20-50 denoising steps

Latent Diffusion (the key innovation in Stable Diffusion): Instead of denoising in pixel space (512×512×3 = huge), work in a compressed latent space (64×64×4 = much smaller). Massively reduces compute.

Key Concepts

Concept Explanation
Noise Schedule How quickly noise is added (linear, cosine, etc.). Affects image quality.
Guidance Scale (CFG) How strongly to follow the text prompt. Higher = more prompt-adherent but less diverse.
Sampling Steps Number of denoising steps. More = better quality, slower. 20-50 is typical.
Negative Prompt What to avoid: "blurry, low quality, distorted"
Inpainting Replace part of an image (mask + prompt for new content)
ControlNet Add spatial control (pose, edges, depth maps) to guide generation
LoRA (for images) Train small adapters to add specific styles/characters to Stable Diffusion

Major Models

Model Company Key Feature
Stable Diffusion XL/3 Stability AI Open-source, customizable, LoRA ecosystem
DALL-E 3 OpenAI Integrated with ChatGPT, best prompt following
Midjourney v6+ Midjourney Highest aesthetic quality, artistic
Imagen 3 Google Google's best, integrated in Gemini
Flux Black Forest Labs Open, high quality, by ex-Stability AI team

◆ Formulas & Equations

Name Formula Variables Use
Forward Process $$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)I)$$ αₜ = noise schedule, I = identity Add noise at step t
Training Objective $$L = \mathbb{E}{t,x_0,\epsilon}\left[|\epsilon - \epsilon\theta(x_t, t)|^2\right]$$ ε = actual noise, ε_θ = predicted noise Train the denoiser
Classifier-Free Guidance $$\hat{\epsilon} = \epsilon_{uncond} + s(\epsilon_{cond} - \epsilon_{uncond})$$ s = guidance scale (typically 7-15) Control prompt adherence

◆ Comparison

Aspect Diffusion Models GANs VAEs
Training Stable (MSE loss) Unstable (adversarial) Stable (ELBO)
Quality Excellent Excellent (when it works) Blurry
Diversity High (no mode collapse) Mode collapse risk High
Speed Slow (many steps) Fast (one forward pass) Fast
Controllability High (CFG, ControlNet) Limited Limited
Status (2026) Dominant Declining for images Niche

◆ Strengths vs Limitations

✅ Strengths ❌ Limitations
Best image quality currently Slow generation (20-50 steps)
Stable training (no mode collapse) High compute for training
Highly controllable (CFG, ControlNet, LoRA) Memory-intensive
Works in multiple domains (image, video, audio, 3D) Theory is mathematically complex
Strong open-source ecosystem (Stable Diffusion) Can reproduce copyrighted styles (legal issues)

○ Interview Angles

  • Q: How do diffusion models generate images?
  • A: During training, the model learns to predict noise added to images at various levels. During generation, start from pure noise and iteratively denoise over many steps, guided by a text prompt using classifier-free guidance.

  • Q: Why did diffusion models replace GANs?

  • A: More stable training (no adversarial min-max game), no mode collapse, higher diversity, better controllability (CFG, ControlNet), and the quality caught up and surpassed GANs.

  • Q: What is classifier-free guidance?

  • A: During training, randomly drop the text condition. At inference, run both conditional and unconditional passes, then amplify the difference. This lets you control how strongly the model follows the prompt (guidance scale parameter).

★ Code & Implementation

Image Generation with DALL-E 3 / Stable Diffusion

# pip install openai>=1.60 diffusers>=0.27 torch>=2.3
# ⚠️ Last tested: 2026-04 | DALL-E requires: openai>=1.60, OPENAI_API_KEY
#                        | SD requires: diffusers>=0.27, GPU recommended

# ═══ Method 1: DALL-E 3 via OpenAI API ═══
from openai import OpenAI
client = OpenAI()

response = client.images.generate(
    model="dall-e-3",
    prompt="A photorealistic transformer robot made of glowing neural connections, dramatic lighting",
    size="1024x1024",
    quality="standard",
    n=1,
)
image_url = response.data[0].url
print(f"DALL-E 3 image URL: {image_url}")
# Note: URL expires after ~1 hour; download immediately

# ═══ Method 2: Stable Diffusion (local, free) ═══
# Requires: CUDA GPU with 6GB+ VRAM
# from diffusers import StableDiffusionPipeline
# import torch
#
# pipe = StableDiffusionPipeline.from_pretrained(
#     "runwayml/stable-diffusion-v1-5",
#     torch_dtype=torch.float16,
# ).to("cuda")
#
# image = pipe(
#     "A photorealistic transformer robot",
#     num_inference_steps=20,  # quality vs speed tradeoff
#     guidance_scale=7.5,      # prompt adherence vs diversity
# ).images[0]
# image.save("output.png")

# ═══ Conceptual DDPM Noise Scheduling ═══
import torch
import math

def cosine_beta_schedule(timesteps: int = 1000) -> torch.Tensor:
    """Cosine noise schedule (Improved DDPM, Ho et al. 2022)."""
    steps = torch.arange(timesteps + 1, dtype=torch.float64)
    s     = 0.008  # small offset to prevent singularity at t=0
    alphas_bar = torch.cos(((steps / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_bar = alphas_bar / alphas_bar[0]
    betas      = 1 - (alphas_bar[1:] / alphas_bar[:-1])
    return betas.clamp(0, 0.999)

betas = cosine_beta_schedule(1000)
print(f"Beta schedule: t=0 → {betas[0]:.6f}, t=500 → {betas[500]:.4f}, t=999 → {betas[-1]:.4f}")
# Noise is added gradually — early steps add tiny noise, late steps add lots

★ Connections

Relationship Topics
Builds on Transformers (attention in U-Net), VAEs (latent space)
Leads to Video generation (Sora), 3D generation, Audio diffusion
Compare with GANs (adversarial), VAEs (variational), Autoregressive image models
Cross-domain Physics (thermodynamic diffusion), Stochastic processes

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Prompt-image misalignment Generated image doesn't match text prompt Model struggles with spatial relationships and counting Negative prompts, prompt engineering, ControlNet guidance
NSFW content generation Inappropriate content despite safety filters Safety classifier misses novel attack vectors Multi-layer safety classifiers, NSFW model fine-tuning
Consistency across generations Same prompt produces wildly different styles No seed management, no style conditioning Fixed seeds for reproducibility, style-conditioned LoRA

◆ Hands-On Exercises

Exercise 1: Compare Diffusion Architectures

Goal: Generate images with different diffusion models and compare quality Time: 30 minutes Steps: 1. Use the same 5 text prompts across SDXL, DALL-E 3, and Midjourney 2. Rate each output on prompt adherence, quality, and style 3. Measure generation time per image 4. Document cost per generation Expected Output: Comparison grid with quality scores and generation metrics


Type Resource Why
📄 Paper Ho et al. "Denoising Diffusion Probabilistic Models" (2020) Foundational diffusion model paper
📄 Paper Rombach et al. "Latent Diffusion Models" (2022) Stable Diffusion architecture paper
🎥 Video Yannic Kilcher — "Diffusion Models" Clear explanation of the diffusion process
🔧 Hands-on HuggingFace Diffusers Library Production diffusion model library

★ Sources

  • Ho et al., "Denoising Diffusion Probabilistic Models" (DDPM, 2020)
  • Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (2022) — Stable Diffusion paper
  • "The Illustrated Stable Diffusion" by Jay Alammar
  • Stability AI documentation — https://stability.ai