Skip to content

Deep Learning Fundamentals

Bit: Training a neural network is like adjusting millions of knobs simultaneously. Each knob only has to move a tiny bit, but do it billions of times and somehow the model learns to write poetry, code, and diagnose diseases.


★ TL;DR

  • What: The training pipeline, optimization techniques, and practical skills for training/fine-tuning deep learning models
  • Why: Knowing architecture (Neural Networks) tells you WHAT a model is. This tells you HOW it learns.
  • Key point: The training loop (forward → loss → backward → update) is the same whether you're training a 10-parameter model or GPT-5. Only scale differs.

★ Overview

Definition

Deep learning fundamentals covers the practical machinery of training neural networks — the training loop, optimizers, regularization, learning rate scheduling, and hardware considerations that turn an untrained model into a useful one.

Scope

Covers training mechanics applicable to all GenAI models. For Transformer-specific architecture, see Transformers. For LLM-specific fine-tuning (LoRA, QLoRA), see Fine Tuning.

Prerequisites


★ Deep Dive

The Training Loop (Universal)

# ⚠️ Last tested: 2026-04
# THE training loop — this is the same for BERT, GPT, Stable Diffusion, everything.
for epoch in range(num_epochs):
    for batch in dataloader:
        # 1. FORWARD PASS: push data through model
        predictions = model(batch.inputs)

        # 2. COMPUTE LOSS: how wrong is the model?
        loss = loss_function(predictions, batch.targets)

        # 3. BACKWARD PASS: compute gradients (backpropagation)
        loss.backward()

        # 4. UPDATE WEIGHTS: adjust model parameters
        optimizer.step()

        # 5. RESET: clear gradients for next iteration
        optimizer.zero_grad()

        # 6. (Optional) SCHEDULE: adjust learning rate
        scheduler.step()
VISUALLY:

  Data ──► [Model] ──► Prediction ──► Loss ──┐
              ↑                                │
              │         ∂Loss/∂w ◄─── Backprop ◄┘
              │              │
              └──── Update ──┘
              w = w - lr × gradient

  Repeat billions of times = trained model

Optimizers (How to Update Weights)

BASIC GRADIENT DESCENT:
  w_new = w_old - learning_rate × gradient

  Problem: Same learning rate for all parameters.
           Noisy updates. Gets stuck in local minima.
Optimizer How It Improves Used In Status
SGD Random mini-batches → faster iterations Classic ML Still used with momentum
SGD + Momentum Accumulates past gradient direction CNNs Image models
Adam Adaptive LR per-parameter + momentum General Most popular default
AdamW Adam + proper weight decay regularization Transformers Standard for LLMs
Adafactor Memory-efficient Adam variant Large models When memory-constrained
LION Simple sign-based updates Emerging Research, sometimes beats Adam

For GenAI: AdamW is the standard. Almost every LLM/Transformer uses AdamW.

Learning Rate (The Most Important Hyperparameter)

TOO HIGH:
  Loss bounces around, never converges, or explodes
  ████████  ████████
  ████         ████    ← Unstable!
      ████████

TOO LOW:
  Loss decreases painfully slowly
  ████████████████████████████  ← Eventually converges but takes forever
  ████████████████████████

JUST RIGHT:
  ████████████
  ████████
  ████████         ← Smooth convergence
  ████████

Learning Rate Schedules:

Schedule How It Works Use
Constant lr stays the same Simple experiments
Linear Warmup + Decay Increase LR from 0 → peak, then decrease Standard for LLMs
Cosine Annealing LR follows cosine curve: high → low → high Longer training
OneCycleLR Warmup → peak → decay in one cycle Efficient training
TYPICAL LLM LEARNING RATE SCHEDULE:

  LR   │     ╱──────╲
       │    ╱         ╲
       │   ╱            ╲
       │  ╱               ╲
       │ ╱                  ╲
       │╱                     ╲
  ─────┼───┬──────────────────┬──────
       │ warmup    training    decay
       │ (~2000     (main      (cool
       │  steps)    phase)     down)

Regularization (Preventing Overfitting)

OVERFITTING:
  "The model memorized the training data instead of learning patterns."

  Training accuracy: 99%   ← Looks great!
  Test accuracy: 60%       ← Actually terrible.

  The model is like a student who memorized test answers
  but doesn't understand the subject.
Technique How It Works Where Used
Dropout Randomly disable neurons during training (e.g., 10%) Transformer attention, FFN
Weight Decay Add penalty for large weights: Loss + λ·‖w‖² AdamW (built-in)
Batch Normalization Normalize layer inputs to mean=0, std=1 CNNs (less in Transformers)
Layer Normalization Normalize across features per sample Transformers (standard)
Data Augmentation Create variations of training data Image models
Early Stopping Stop when validation loss starts increasing All models

For Transformers: Layer Normalization + Dropout + Weight Decay (via AdamW) is the standard combo.

GPU/CUDA Basics

WHY GPUs FOR AI?

  CPU: 8-128 cores → Great at complex sequential tasks
  GPU: 10,000+ cores → Great at simple parallel tasks

  Neural network = millions of identical multiply-add operations
  = PERFECT for GPUs

NVIDIA GPU HIERARCHY (2025-2026):
  Consumer:     RTX 4090 (24GB) → Fine for inference, small training
  Professional: A100 (40/80GB) → The workhorse of AI training
  Latest:       H100 (80GB) → 2-3x faster than A100
  Newest:       B200 (Blackwell, 192GB) → Next generation

VRAM IS THE BOTTLENECK:
  Model must fit in GPU memory (VRAM)
  LLaMA 7B in FP16 = ~14 GB → Fits on 1× RTX 4090
  LLaMA 70B in FP16 = ~140 GB → Need 2× A100 80GB
  LLaMA 70B in INT4 = ~35 GB → Fits on 1× A100 40GB or RTX 4090!
# ⚠️ Last tested: 2026-04
# Check GPU
import torch
print(torch.cuda.is_available())          # True if GPU ready
print(torch.cuda.get_device_name(0))      # e.g., "NVIDIA GeForce RTX 4090"
print(f"{torch.cuda.mem_get_info()[0]/1e9:.1f} GB free")  # Available VRAM

# Mixed precision training (use FP16 for speed, FP32 for stability)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast():                     # Use FP16 where safe
    output = model(input)
    loss = loss_fn(output, target)

scaler.scale(loss).backward()        # Scale loss to prevent underflow
scaler.step(optimizer)               # Unscale and step
scaler.update()

Common Training Problems & Fixes

Problem Symptom Fix
Loss not decreasing Loss stays flat or increases Lower LR, check data, check loss function
Loss explodes (NaN) Loss = inf or NaN Lower LR, gradient clipping, check data
Overfitting Train loss ↓, val loss ↑ More data, dropout, weight decay, early stopping
Underfitting Both losses stay high Bigger model, more training, higher LR
OOM (Out of Memory) CUDA out of memory error Smaller batch size, gradient accumulation, quantization
Slow training Each step takes too long Mixed precision, compiled model, better data loading

◆ Quick Reference

TRAINING RECIPE (LLM fine-tuning):
  Optimizer: AdamW
  LR: 1e-4 to 2e-5  (lower for bigger models)
  Schedule: Linear warmup (5-10% of steps) + cosine decay
  Batch size: As large as VRAM allows (use gradient accumulation)
  Epochs: 1-3 (for fine-tuning; pre-training = 1 pass over data)
  Precision: BF16 or FP16 (mixed precision)
  Regularization: Dropout 0.1 + weight decay 0.01

MEMORY-SAVING TRICKS:
  1. Gradient accumulation (simulate large batches)
  2. Mixed precision (FP16/BF16)
  3. Gradient checkpointing (recompute instead of store)
  4. LoRA/QLoRA (train only small adapters)
  5. DeepSpeed / FSDP (distribute across GPUs)

METRICS TO MONITOR:
  - Training loss (should decrease)
  - Validation loss (should decrease, not diverge from train)
  - Learning rate (check schedule is working)
  - GPU utilization (should be >90%)
  - Memory usage (stay under limit)

○ Gotchas & Common Mistakes

  • ⚠️ Learning rate too high: The #1 cause of training failure. Start lower than you think.
  • ⚠️ No warmup: Starting with high LR can destabilize early training. Always use warmup for Transformers.
  • ⚠️ Forgetting model.eval(): For inference, always set model.eval() — it disables dropout and changes batch norm behavior.
  • ⚠️ Not monitoring validation loss: You won't catch overfitting without a separate validation set.
  • ⚠️ Gradient accumulation math: If accumulating over N steps, effective batch size = micro_batch × N. Scale LR accordingly.

○ Interview Angles

  • Q: What optimizer do you use for training Transformers and why?
  • A: AdamW. It's Adam with decoupled weight decay, which provides better regularization for Transformers. Adam adapts the learning rate per-parameter using running estimates of gradient mean and variance.

  • Q: How would you handle GPU memory limitations when training?

  • A: (1) Reduce batch size + gradient accumulation, (2) Mixed precision (BF16), (3) Gradient checkpointing, (4) LoRA/QLoRA (train small adapters not full model), (5) DeepSpeed ZeRO / FSDP (distribute across GPUs).

★ Code & Implementation

Backpropagation from Scratch + PyTorch Comparison

# pip install torch>=2.3
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3
import torch
import torch.nn as nn
import torch.nn.functional as F

# ═══ Manual 2-layer MLP forward + backward ═══
torch.manual_seed(42)
X = torch.randn(32, 10)      # 32 samples, 10 features
y = torch.randint(0, 3, (32,))  # 3-class labels

# Weights
W1 = torch.randn(10, 64, requires_grad=True)
b1 = torch.zeros(64,      requires_grad=True)
W2 = torch.randn(64, 3,  requires_grad=True)
b2 = torch.zeros(3,       requires_grad=True)

lr = 1e-3
for epoch in range(5):
    # Forward
    h   = F.relu(X @ W1 + b1)    # (32, 64)
    out = h @ W2 + b2             # (32, 3)
    loss = F.cross_entropy(out, y)

    # Backward
    loss.backward()

    # SGD update
    with torch.no_grad():
        W1 -= lr * W1.grad; W1.grad.zero_()
        b1 -= lr * b1.grad; b1.grad.zero_()
        W2 -= lr * W2.grad; W2.grad.zero_()
        b2 -= lr * b2.grad; b2.grad.zero_()

    print(f"Epoch {epoch+1}: loss={loss.item():.4f}")

# ═══ Same with nn.Module ═══ (idiomatic PyTorch)
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(10, 64), nn.ReLU(),
            nn.Linear(64, 3),
        )
    def forward(self, x): return self.net(x)

model     = MLP()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(5):
    optimizer.zero_grad()
    loss = F.cross_entropy(model(X), y)
    loss.backward()
    optimizer.step()
print(f"nn.Module final loss: {loss.item():.4f}")

★ Connections

Relationship Topics
Builds on Neural Networks, Linear Algebra For Ai, Probability And Statistics
Leads to Transformers, Fine Tuning, Inference Optimization
Compare with Classical ML training (scikit-learn — much simpler)
Cross-domain Optimization theory, Numerical methods, Systems engineering

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Vanishing/exploding gradients Training loss plateaus or diverges Deep networks with poor initialization or no normalization Layer normalization, residual connections, careful init
Overfitting on small datasets Training accuracy 99% but test accuracy 60% Insufficient regularization, model too large for data Dropout, weight decay, data augmentation, early stopping
Learning rate pathology Training never converges or oscillates LR too high/low, no schedule LR finder, cosine annealing, warmup + decay

◆ Hands-On Exercises

Exercise 1: Diagnose Training Pathologies

Goal: Deliberately cause and then fix common training failures Time: 30 minutes Steps: 1. Train a small neural network on MNIST 2. Remove batch normalization — observe gradient issues 3. Set learning rate to 1.0 — observe divergence 4. Fix each issue and plot the corrected training curves Expected Output: Before/after training curves for each pathology


Type Resource Why
📘 Book "Deep Learning" by Goodfellow, Bengio, Courville (2016) The definitive deep learning textbook
🎓 Course fast.ai — Practical Deep Learning Best practical introduction to deep learning
🎥 Video 3Blue1Brown — "Neural Networks" Beautiful visual explanations of DL concepts

★ Sources

  • Karpathy, "Let's Build GPT from Scratch" — https://youtube.com/watch?v=kCc8FmEb1nY
  • Ian Goodfellow, "Deep Learning" Chapters 7-8 (Regularization, Optimization)
  • PyTorch Training Tutorial — https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html
  • Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (AdamW, 2017)