Diffusion Models¶

✨ Bit: Diffusion models literally learn to un-destroy images — start with pure noise, progressively denoise until an image emerges. It's like teaching AI to reverse entropy.

★ TL;DR¶

What: Generative models that create images by learning to reverse a gradual noise-addition process
Why: Surpassed GANs in image quality and stability. Power Stable Diffusion, DALL-E, Midjourney
Key point: Forward process = add noise step by step until pure noise. Reverse process = learn to denoise step by step.

★ Overview¶

Definition¶

Diffusion Models (specifically Denoising Diffusion Probabilistic Models — DDPMs) are generative models that learn to generate data by reversing a gradual noising process. Training teaches the model to denoise slightly noisy images at each step; generation starts from pure noise and iteratively denoises.

Scope¶

Covers diffusion model theory, architecture, and key models. For practical image generation tools, see individual model pages.

Significance¶

Replaced GANs as the dominant image generation architecture (2022+)
Power: Stable Diffusion, DALL-E 2/3, Midjourney, Imagen
Extended to: Video (Sora, Veo), Audio, 3D, and even molecular design
More stable training than GANs (no mode collapse)

Prerequisites¶

Transformers — U-Net and attention are used inside diffusion models
Basic probability concepts

★ Deep Dive¶

The Core Idea¶

FORWARD PROCESS (Training — add noise):

  Clean Image → Slightly Noisy → More Noisy → ... → Pure Gaussian Noise
  x₀           x₁               x₂              xₜ

  Each step: xₜ = √(αₜ)·xₜ₋₁ + √(1-αₜ)·ε    (ε ~ Normal(0,1))

REVERSE PROCESS (Generation — remove noise):

  Pure Noise → Slightly Less Noisy → ... → Clean Image!
  xₜ          xₜ₋₁                    x₀

  The model learns: "Given noisy image xₜ, predict the noise ε"
  Then subtract that predicted noise to get xₜ₋₁

                      ┌──────── FORWARD (destroy) ────────┐
                      │                                    │
  [Clean Image] → [Noisy] → [Noisier] → ... → [Pure Noise]
  [Clean Image] ← [Noisy] ← [Noisier] ← ... ← [Pure Noise]
                      │                                    │
                      └──────── REVERSE (create) ─────────┘

Architecture: U-Net + Attention¶

Diffusion Model Architecture:

  Noisy Image ──► [U-Net with Attention] ──► Predicted Noise
       +
  Time Step t  ──► (embedded and injected)
       +
  Text Prompt  ──► (cross-attention with text encoder output)

U-Net Structure:
  ┌───────────────────────────────────┐
  │ Encoder                           │
  │  ↓ Downsample + Attention blocks  │
  ├───────────────────────────────────┤
  │ Bottleneck (Attention)            │
  ├───────────────────────────────────┤
  │ Decoder                           │
  │  ↑ Upsample + Skip connections    │
  └───────────────────────────────────┘

Text-to-Image Pipeline (Stable Diffusion)¶

"A cat astronaut on Mars"
    │
    ▼
┌────────────┐    ┌───────────────┐    ┌──────────┐
│ Text       │    │ Diffusion     │    │ VAE      │
│ Encoder    │ →  │ U-Net         │ →  │ Decoder  │ → Image!
│ (CLIP)     │    │ (in latent    │    │ (latent  │
│            │    │  space, not   │    │  → pixel)│
└────────────┘    │  pixel space) │    └──────────┘
                  └───────────────┘
                  Runs ~20-50 denoising steps

Latent Diffusion (the key innovation in Stable Diffusion): Instead of denoising in pixel space (512×512×3 = huge), work in a compressed latent space (64×64×4 = much smaller). Massively reduces compute.

Key Concepts¶

Concept	Explanation
Noise Schedule	How quickly noise is added (linear, cosine, etc.). Affects image quality.
Guidance Scale (CFG)	How strongly to follow the text prompt. Higher = more prompt-adherent but less diverse.
Sampling Steps	Number of denoising steps. More = better quality, slower. 20-50 is typical.
Negative Prompt	What to avoid: "blurry, low quality, distorted"
Inpainting	Replace part of an image (mask + prompt for new content)
ControlNet	Add spatial control (pose, edges, depth maps) to guide generation
LoRA (for images)	Train small adapters to add specific styles/characters to Stable Diffusion

Major Models¶

Model	Company	Key Feature
Stable Diffusion XL/3	Stability AI	Open-source, customizable, LoRA ecosystem
DALL-E 3	OpenAI	Integrated with ChatGPT, best prompt following
Midjourney v6+	Midjourney	Highest aesthetic quality, artistic
Imagen 3	Google	Google's best, integrated in Gemini
Flux	Black Forest Labs	Open, high quality, by ex-Stability AI team

◆ Formulas & Equations¶

Name	Formula	Variables	Use
Forward Process	$$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)I)$$	αₜ = noise schedule, I = identity	Add noise at step t
Training Objective	$$L = \mathbb{E}{t,x_0,\epsilon}\left[\|\epsilon - \epsilon\theta(x_t, t)\|^2\right]$$	ε = actual noise, ε_θ = predicted noise	Train the denoiser
Classifier-Free Guidance	$$\hat{\epsilon} = \epsilon_{uncond} + s(\epsilon_{cond} - \epsilon_{uncond})$$	s = guidance scale (typically 7-15)	Control prompt adherence

◆ Comparison¶

Aspect	Diffusion Models	GANs	VAEs
Training	Stable (MSE loss)	Unstable (adversarial)	Stable (ELBO)
Quality	Excellent	Excellent (when it works)	Blurry
Diversity	High (no mode collapse)	Mode collapse risk	High
Speed	Slow (many steps)	Fast (one forward pass)	Fast
Controllability	High (CFG, ControlNet)	Limited	Limited
Status (2026)	Dominant	Declining for images	Niche

◆ Strengths vs Limitations¶

✅ Strengths	❌ Limitations
Best image quality currently	Slow generation (20-50 steps)
Stable training (no mode collapse)	High compute for training
Highly controllable (CFG, ControlNet, LoRA)	Memory-intensive
Works in multiple domains (image, video, audio, 3D)	Theory is mathematically complex
Strong open-source ecosystem (Stable Diffusion)	Can reproduce copyrighted styles (legal issues)

○ Interview Angles¶

Q: How do diffusion models generate images?
A: During training, the model learns to predict noise added to images at various levels. During generation, start from pure noise and iteratively denoise over many steps, guided by a text prompt using classifier-free guidance.
Q: Why did diffusion models replace GANs?
A: More stable training (no adversarial min-max game), no mode collapse, higher diversity, better controllability (CFG, ControlNet), and the quality caught up and surpassed GANs.
Q: What is classifier-free guidance?
A: During training, randomly drop the text condition. At inference, run both conditional and unconditional passes, then amplify the difference. This lets you control how strongly the model follows the prompt (guidance scale parameter).

★ Code & Implementation¶

Image Generation with DALL-E 3 / Stable Diffusion¶

# pip install openai>=1.60 diffusers>=0.27 torch>=2.3
# ⚠️ Last tested: 2026-04 | DALL-E requires: openai>=1.60, OPENAI_API_KEY
#                        | SD requires: diffusers>=0.27, GPU recommended

# ═══ Method 1: DALL-E 3 via OpenAI API ═══
from openai import OpenAI
client = OpenAI()

response = client.images.generate(
    model="dall-e-3",
    prompt="A photorealistic transformer robot made of glowing neural connections, dramatic lighting",
    size="1024x1024",
    quality="standard",
    n=1,
)
image_url = response.data[0].url
print(f"DALL-E 3 image URL: {image_url}")
# Note: URL expires after ~1 hour; download immediately

# ═══ Method 2: Stable Diffusion (local, free) ═══
# Requires: CUDA GPU with 6GB+ VRAM
# from diffusers import StableDiffusionPipeline
# import torch
#
# pipe = StableDiffusionPipeline.from_pretrained(
#     "runwayml/stable-diffusion-v1-5",
#     torch_dtype=torch.float16,
# ).to("cuda")
#
# image = pipe(
#     "A photorealistic transformer robot",
#     num_inference_steps=20,  # quality vs speed tradeoff
#     guidance_scale=7.5,      # prompt adherence vs diversity
# ).images[0]
# image.save("output.png")

# ═══ Conceptual DDPM Noise Scheduling ═══
import torch
import math

def cosine_beta_schedule(timesteps: int = 1000) -> torch.Tensor:
    """Cosine noise schedule (Improved DDPM, Ho et al. 2022)."""
    steps = torch.arange(timesteps + 1, dtype=torch.float64)
    s     = 0.008  # small offset to prevent singularity at t=0
    alphas_bar = torch.cos(((steps / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_bar = alphas_bar / alphas_bar[0]
    betas      = 1 - (alphas_bar[1:] / alphas_bar[:-1])
    return betas.clamp(0, 0.999)

betas = cosine_beta_schedule(1000)
print(f"Beta schedule: t=0 → {betas[0]:.6f}, t=500 → {betas[500]:.4f}, t=999 → {betas[-1]:.4f}")
# Noise is added gradually — early steps add tiny noise, late steps add lots

★ Connections¶

Relationship	Topics
Builds on	Transformers (attention in U-Net), VAEs (latent space)
Leads to	Video generation (Sora), 3D generation, Audio diffusion
Compare with	GANs (adversarial), VAEs (variational), Autoregressive image models
Cross-domain	Physics (thermodynamic diffusion), Stochastic processes

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Prompt-image misalignment	Generated image doesn't match text prompt	Model struggles with spatial relationships and counting	Negative prompts, prompt engineering, ControlNet guidance
NSFW content generation	Inappropriate content despite safety filters	Safety classifier misses novel attack vectors	Multi-layer safety classifiers, NSFW model fine-tuning
Consistency across generations	Same prompt produces wildly different styles	No seed management, no style conditioning	Fixed seeds for reproducibility, style-conditioned LoRA

◆ Hands-On Exercises¶

Exercise 1: Compare Diffusion Architectures¶

Goal: Generate images with different diffusion models and compare quality Time: 30 minutes Steps: 1. Use the same 5 text prompts across SDXL, DALL-E 3, and Midjourney 2. Rate each output on prompt adherence, quality, and style 3. Measure generation time per image 4. Document cost per generation Expected Output: Comparison grid with quality scores and generation metrics

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Ho et al. "Denoising Diffusion Probabilistic Models" (2020)	Foundational diffusion model paper
📄 Paper	Rombach et al. "Latent Diffusion Models" (2022)	Stable Diffusion architecture paper
🎥 Video	Yannic Kilcher — "Diffusion Models"	Clear explanation of the diffusion process
🔧 Hands-on	HuggingFace Diffusers Library	Production diffusion model library

★ Sources¶

Ho et al., "Denoising Diffusion Probabilistic Models" (DDPM, 2020)
Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (2022) — Stable Diffusion paper
"The Illustrated Stable Diffusion" by Jay Alammar
Stability AI documentation — https://stability.ai