Diffusion Models¶
✨ Bit: Diffusion models literally learn to un-destroy images — start with pure noise, progressively denoise until an image emerges. It's like teaching AI to reverse entropy.
★ TL;DR¶
- What: Generative models that create images by learning to reverse a gradual noise-addition process
- Why: Surpassed GANs in image quality and stability. Power Stable Diffusion, DALL-E, Midjourney
- Key point: Forward process = add noise step by step until pure noise. Reverse process = learn to denoise step by step.
★ Overview¶
Definition¶
Diffusion Models (specifically Denoising Diffusion Probabilistic Models — DDPMs) are generative models that learn to generate data by reversing a gradual noising process. Training teaches the model to denoise slightly noisy images at each step; generation starts from pure noise and iteratively denoises.
Scope¶
Covers diffusion model theory, architecture, and key models. For practical image generation tools, see individual model pages.
Significance¶
- Replaced GANs as the dominant image generation architecture (2022+)
- Power: Stable Diffusion, DALL-E 2/3, Midjourney, Imagen
- Extended to: Video (Sora, Veo), Audio, 3D, and even molecular design
- More stable training than GANs (no mode collapse)
Prerequisites¶
- Transformers — U-Net and attention are used inside diffusion models
- Basic probability concepts
★ Deep Dive¶
The Core Idea¶
FORWARD PROCESS (Training — add noise):
Clean Image → Slightly Noisy → More Noisy → ... → Pure Gaussian Noise
x₀ x₁ x₂ xₜ
Each step: xₜ = √(αₜ)·xₜ₋₁ + √(1-αₜ)·ε (ε ~ Normal(0,1))
REVERSE PROCESS (Generation — remove noise):
Pure Noise → Slightly Less Noisy → ... → Clean Image!
xₜ xₜ₋₁ x₀
The model learns: "Given noisy image xₜ, predict the noise ε"
Then subtract that predicted noise to get xₜ₋₁
┌──────── FORWARD (destroy) ────────┐
│ │
[Clean Image] → [Noisy] → [Noisier] → ... → [Pure Noise]
[Clean Image] ← [Noisy] ← [Noisier] ← ... ← [Pure Noise]
│ │
└──────── REVERSE (create) ─────────┘
Architecture: U-Net + Attention¶
Diffusion Model Architecture:
Noisy Image ──► [U-Net with Attention] ──► Predicted Noise
+
Time Step t ──► (embedded and injected)
+
Text Prompt ──► (cross-attention with text encoder output)
U-Net Structure:
┌───────────────────────────────────┐
│ Encoder │
│ ↓ Downsample + Attention blocks │
├───────────────────────────────────┤
│ Bottleneck (Attention) │
├───────────────────────────────────┤
│ Decoder │
│ ↑ Upsample + Skip connections │
└───────────────────────────────────┘
Text-to-Image Pipeline (Stable Diffusion)¶
"A cat astronaut on Mars"
│
▼
┌────────────┐ ┌───────────────┐ ┌──────────┐
│ Text │ │ Diffusion │ │ VAE │
│ Encoder │ → │ U-Net │ → │ Decoder │ → Image!
│ (CLIP) │ │ (in latent │ │ (latent │
│ │ │ space, not │ │ → pixel)│
└────────────┘ │ pixel space) │ └──────────┘
└───────────────┘
Runs ~20-50 denoising steps
Latent Diffusion (the key innovation in Stable Diffusion): Instead of denoising in pixel space (512×512×3 = huge), work in a compressed latent space (64×64×4 = much smaller). Massively reduces compute.
Key Concepts¶
| Concept | Explanation |
|---|---|
| Noise Schedule | How quickly noise is added (linear, cosine, etc.). Affects image quality. |
| Guidance Scale (CFG) | How strongly to follow the text prompt. Higher = more prompt-adherent but less diverse. |
| Sampling Steps | Number of denoising steps. More = better quality, slower. 20-50 is typical. |
| Negative Prompt | What to avoid: "blurry, low quality, distorted" |
| Inpainting | Replace part of an image (mask + prompt for new content) |
| ControlNet | Add spatial control (pose, edges, depth maps) to guide generation |
| LoRA (for images) | Train small adapters to add specific styles/characters to Stable Diffusion |
Major Models¶
| Model | Company | Key Feature |
|---|---|---|
| Stable Diffusion XL/3 | Stability AI | Open-source, customizable, LoRA ecosystem |
| DALL-E 3 | OpenAI | Integrated with ChatGPT, best prompt following |
| Midjourney v6+ | Midjourney | Highest aesthetic quality, artistic |
| Imagen 3 | Google's best, integrated in Gemini | |
| Flux | Black Forest Labs | Open, high quality, by ex-Stability AI team |
◆ Formulas & Equations¶
| Name | Formula | Variables | Use |
|---|---|---|---|
| Forward Process | $$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)I)$$ | αₜ = noise schedule, I = identity | Add noise at step t |
| Training Objective | $$L = \mathbb{E}{t,x_0,\epsilon}\left[|\epsilon - \epsilon\theta(x_t, t)|^2\right]$$ | ε = actual noise, ε_θ = predicted noise | Train the denoiser |
| Classifier-Free Guidance | $$\hat{\epsilon} = \epsilon_{uncond} + s(\epsilon_{cond} - \epsilon_{uncond})$$ | s = guidance scale (typically 7-15) | Control prompt adherence |
◆ Comparison¶
| Aspect | Diffusion Models | GANs | VAEs |
|---|---|---|---|
| Training | Stable (MSE loss) | Unstable (adversarial) | Stable (ELBO) |
| Quality | Excellent | Excellent (when it works) | Blurry |
| Diversity | High (no mode collapse) | Mode collapse risk | High |
| Speed | Slow (many steps) | Fast (one forward pass) | Fast |
| Controllability | High (CFG, ControlNet) | Limited | Limited |
| Status (2026) | Dominant | Declining for images | Niche |
◆ Strengths vs Limitations¶
| ✅ Strengths | ❌ Limitations |
|---|---|
| Best image quality currently | Slow generation (20-50 steps) |
| Stable training (no mode collapse) | High compute for training |
| Highly controllable (CFG, ControlNet, LoRA) | Memory-intensive |
| Works in multiple domains (image, video, audio, 3D) | Theory is mathematically complex |
| Strong open-source ecosystem (Stable Diffusion) | Can reproduce copyrighted styles (legal issues) |
○ Interview Angles¶
- Q: How do diffusion models generate images?
-
A: During training, the model learns to predict noise added to images at various levels. During generation, start from pure noise and iteratively denoise over many steps, guided by a text prompt using classifier-free guidance.
-
Q: Why did diffusion models replace GANs?
-
A: More stable training (no adversarial min-max game), no mode collapse, higher diversity, better controllability (CFG, ControlNet), and the quality caught up and surpassed GANs.
-
Q: What is classifier-free guidance?
- A: During training, randomly drop the text condition. At inference, run both conditional and unconditional passes, then amplify the difference. This lets you control how strongly the model follows the prompt (guidance scale parameter).
★ Code & Implementation¶
Image Generation with DALL-E 3 / Stable Diffusion¶
# pip install openai>=1.60 diffusers>=0.27 torch>=2.3
# ⚠️ Last tested: 2026-04 | DALL-E requires: openai>=1.60, OPENAI_API_KEY
# | SD requires: diffusers>=0.27, GPU recommended
# ═══ Method 1: DALL-E 3 via OpenAI API ═══
from openai import OpenAI
client = OpenAI()
response = client.images.generate(
model="dall-e-3",
prompt="A photorealistic transformer robot made of glowing neural connections, dramatic lighting",
size="1024x1024",
quality="standard",
n=1,
)
image_url = response.data[0].url
print(f"DALL-E 3 image URL: {image_url}")
# Note: URL expires after ~1 hour; download immediately
# ═══ Method 2: Stable Diffusion (local, free) ═══
# Requires: CUDA GPU with 6GB+ VRAM
# from diffusers import StableDiffusionPipeline
# import torch
#
# pipe = StableDiffusionPipeline.from_pretrained(
# "runwayml/stable-diffusion-v1-5",
# torch_dtype=torch.float16,
# ).to("cuda")
#
# image = pipe(
# "A photorealistic transformer robot",
# num_inference_steps=20, # quality vs speed tradeoff
# guidance_scale=7.5, # prompt adherence vs diversity
# ).images[0]
# image.save("output.png")
# ═══ Conceptual DDPM Noise Scheduling ═══
import torch
import math
def cosine_beta_schedule(timesteps: int = 1000) -> torch.Tensor:
"""Cosine noise schedule (Improved DDPM, Ho et al. 2022)."""
steps = torch.arange(timesteps + 1, dtype=torch.float64)
s = 0.008 # small offset to prevent singularity at t=0
alphas_bar = torch.cos(((steps / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
alphas_bar = alphas_bar / alphas_bar[0]
betas = 1 - (alphas_bar[1:] / alphas_bar[:-1])
return betas.clamp(0, 0.999)
betas = cosine_beta_schedule(1000)
print(f"Beta schedule: t=0 → {betas[0]:.6f}, t=500 → {betas[500]:.4f}, t=999 → {betas[-1]:.4f}")
# Noise is added gradually — early steps add tiny noise, late steps add lots
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Transformers (attention in U-Net), VAEs (latent space) |
| Leads to | Video generation (Sora), 3D generation, Audio diffusion |
| Compare with | GANs (adversarial), VAEs (variational), Autoregressive image models |
| Cross-domain | Physics (thermodynamic diffusion), Stochastic processes |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Prompt-image misalignment | Generated image doesn't match text prompt | Model struggles with spatial relationships and counting | Negative prompts, prompt engineering, ControlNet guidance |
| NSFW content generation | Inappropriate content despite safety filters | Safety classifier misses novel attack vectors | Multi-layer safety classifiers, NSFW model fine-tuning |
| Consistency across generations | Same prompt produces wildly different styles | No seed management, no style conditioning | Fixed seeds for reproducibility, style-conditioned LoRA |
◆ Hands-On Exercises¶
Exercise 1: Compare Diffusion Architectures¶
Goal: Generate images with different diffusion models and compare quality Time: 30 minutes Steps: 1. Use the same 5 text prompts across SDXL, DALL-E 3, and Midjourney 2. Rate each output on prompt adherence, quality, and style 3. Measure generation time per image 4. Document cost per generation Expected Output: Comparison grid with quality scores and generation metrics
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Ho et al. "Denoising Diffusion Probabilistic Models" (2020) | Foundational diffusion model paper |
| 📄 Paper | Rombach et al. "Latent Diffusion Models" (2022) | Stable Diffusion architecture paper |
| 🎥 Video | Yannic Kilcher — "Diffusion Models" | Clear explanation of the diffusion process |
| 🔧 Hands-on | HuggingFace Diffusers Library | Production diffusion model library |
★ Sources¶
- Ho et al., "Denoising Diffusion Probabilistic Models" (DDPM, 2020)
- Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (2022) — Stable Diffusion paper
- "The Illustrated Stable Diffusion" by Jay Alammar
- Stability AI documentation — https://stability.ai