Deep Learning Fundamentals¶
✨ Bit: Training a neural network is like adjusting millions of knobs simultaneously. Each knob only has to move a tiny bit, but do it billions of times and somehow the model learns to write poetry, code, and diagnose diseases.
★ TL;DR¶
- What: The training pipeline, optimization techniques, and practical skills for training/fine-tuning deep learning models
- Why: Knowing architecture (Neural Networks) tells you WHAT a model is. This tells you HOW it learns.
- Key point: The training loop (forward → loss → backward → update) is the same whether you're training a 10-parameter model or GPT-5. Only scale differs.
★ Overview¶
Definition¶
Deep learning fundamentals covers the practical machinery of training neural networks — the training loop, optimizers, regularization, learning rate scheduling, and hardware considerations that turn an untrained model into a useful one.
Scope¶
Covers training mechanics applicable to all GenAI models. For Transformer-specific architecture, see Transformers. For LLM-specific fine-tuning (LoRA, QLoRA), see Fine Tuning.
Prerequisites¶
- Neural Networks — architecture basics
- Linear Algebra For Ai — matrix operations
- Probability And Statistics — loss functions
★ Deep Dive¶
The Training Loop (Universal)¶
# ⚠️ Last tested: 2026-04
# THE training loop — this is the same for BERT, GPT, Stable Diffusion, everything.
for epoch in range(num_epochs):
for batch in dataloader:
# 1. FORWARD PASS: push data through model
predictions = model(batch.inputs)
# 2. COMPUTE LOSS: how wrong is the model?
loss = loss_function(predictions, batch.targets)
# 3. BACKWARD PASS: compute gradients (backpropagation)
loss.backward()
# 4. UPDATE WEIGHTS: adjust model parameters
optimizer.step()
# 5. RESET: clear gradients for next iteration
optimizer.zero_grad()
# 6. (Optional) SCHEDULE: adjust learning rate
scheduler.step()
VISUALLY:
Data ──► [Model] ──► Prediction ──► Loss ──┐
↑ │
│ ∂Loss/∂w ◄─── Backprop ◄┘
│ │
└──── Update ──┘
w = w - lr × gradient
Repeat billions of times = trained model
Optimizers (How to Update Weights)¶
BASIC GRADIENT DESCENT:
w_new = w_old - learning_rate × gradient
Problem: Same learning rate for all parameters.
Noisy updates. Gets stuck in local minima.
| Optimizer | How It Improves | Used In | Status |
|---|---|---|---|
| SGD | Random mini-batches → faster iterations | Classic ML | Still used with momentum |
| SGD + Momentum | Accumulates past gradient direction | CNNs | Image models |
| Adam | Adaptive LR per-parameter + momentum | General | Most popular default |
| AdamW | Adam + proper weight decay regularization | Transformers | Standard for LLMs |
| Adafactor | Memory-efficient Adam variant | Large models | When memory-constrained |
| LION | Simple sign-based updates | Emerging | Research, sometimes beats Adam |
For GenAI: AdamW is the standard. Almost every LLM/Transformer uses AdamW.
Learning Rate (The Most Important Hyperparameter)¶
TOO HIGH:
Loss bounces around, never converges, or explodes
████████ ████████
████ ████ ← Unstable!
████████
TOO LOW:
Loss decreases painfully slowly
████████████████████████████ ← Eventually converges but takes forever
████████████████████████
JUST RIGHT:
████████████
████████
████████ ← Smooth convergence
████████
Learning Rate Schedules:
| Schedule | How It Works | Use |
|---|---|---|
| Constant | lr stays the same | Simple experiments |
| Linear Warmup + Decay | Increase LR from 0 → peak, then decrease | Standard for LLMs |
| Cosine Annealing | LR follows cosine curve: high → low → high | Longer training |
| OneCycleLR | Warmup → peak → decay in one cycle | Efficient training |
TYPICAL LLM LEARNING RATE SCHEDULE:
LR │ ╱──────╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│╱ ╲
─────┼───┬──────────────────┬──────
│ warmup training decay
│ (~2000 (main (cool
│ steps) phase) down)
Regularization (Preventing Overfitting)¶
OVERFITTING:
"The model memorized the training data instead of learning patterns."
Training accuracy: 99% ← Looks great!
Test accuracy: 60% ← Actually terrible.
The model is like a student who memorized test answers
but doesn't understand the subject.
| Technique | How It Works | Where Used |
|---|---|---|
| Dropout | Randomly disable neurons during training (e.g., 10%) | Transformer attention, FFN |
| Weight Decay | Add penalty for large weights: Loss + λ·‖w‖² | AdamW (built-in) |
| Batch Normalization | Normalize layer inputs to mean=0, std=1 | CNNs (less in Transformers) |
| Layer Normalization | Normalize across features per sample | Transformers (standard) |
| Data Augmentation | Create variations of training data | Image models |
| Early Stopping | Stop when validation loss starts increasing | All models |
For Transformers: Layer Normalization + Dropout + Weight Decay (via AdamW) is the standard combo.
GPU/CUDA Basics¶
WHY GPUs FOR AI?
CPU: 8-128 cores → Great at complex sequential tasks
GPU: 10,000+ cores → Great at simple parallel tasks
Neural network = millions of identical multiply-add operations
= PERFECT for GPUs
NVIDIA GPU HIERARCHY (2025-2026):
Consumer: RTX 4090 (24GB) → Fine for inference, small training
Professional: A100 (40/80GB) → The workhorse of AI training
Latest: H100 (80GB) → 2-3x faster than A100
Newest: B200 (Blackwell, 192GB) → Next generation
VRAM IS THE BOTTLENECK:
Model must fit in GPU memory (VRAM)
LLaMA 7B in FP16 = ~14 GB → Fits on 1× RTX 4090
LLaMA 70B in FP16 = ~140 GB → Need 2× A100 80GB
LLaMA 70B in INT4 = ~35 GB → Fits on 1× A100 40GB or RTX 4090!
# ⚠️ Last tested: 2026-04
# Check GPU
import torch
print(torch.cuda.is_available()) # True if GPU ready
print(torch.cuda.get_device_name(0)) # e.g., "NVIDIA GeForce RTX 4090"
print(f"{torch.cuda.mem_get_info()[0]/1e9:.1f} GB free") # Available VRAM
# Mixed precision training (use FP16 for speed, FP32 for stability)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast(): # Use FP16 where safe
output = model(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward() # Scale loss to prevent underflow
scaler.step(optimizer) # Unscale and step
scaler.update()
Common Training Problems & Fixes¶
| Problem | Symptom | Fix |
|---|---|---|
| Loss not decreasing | Loss stays flat or increases | Lower LR, check data, check loss function |
| Loss explodes (NaN) | Loss = inf or NaN | Lower LR, gradient clipping, check data |
| Overfitting | Train loss ↓, val loss ↑ | More data, dropout, weight decay, early stopping |
| Underfitting | Both losses stay high | Bigger model, more training, higher LR |
| OOM (Out of Memory) | CUDA out of memory error | Smaller batch size, gradient accumulation, quantization |
| Slow training | Each step takes too long | Mixed precision, compiled model, better data loading |
◆ Quick Reference¶
TRAINING RECIPE (LLM fine-tuning):
Optimizer: AdamW
LR: 1e-4 to 2e-5 (lower for bigger models)
Schedule: Linear warmup (5-10% of steps) + cosine decay
Batch size: As large as VRAM allows (use gradient accumulation)
Epochs: 1-3 (for fine-tuning; pre-training = 1 pass over data)
Precision: BF16 or FP16 (mixed precision)
Regularization: Dropout 0.1 + weight decay 0.01
MEMORY-SAVING TRICKS:
1. Gradient accumulation (simulate large batches)
2. Mixed precision (FP16/BF16)
3. Gradient checkpointing (recompute instead of store)
4. LoRA/QLoRA (train only small adapters)
5. DeepSpeed / FSDP (distribute across GPUs)
METRICS TO MONITOR:
- Training loss (should decrease)
- Validation loss (should decrease, not diverge from train)
- Learning rate (check schedule is working)
- GPU utilization (should be >90%)
- Memory usage (stay under limit)
○ Gotchas & Common Mistakes¶
- ⚠️ Learning rate too high: The #1 cause of training failure. Start lower than you think.
- ⚠️ No warmup: Starting with high LR can destabilize early training. Always use warmup for Transformers.
- ⚠️ Forgetting
model.eval(): For inference, always setmodel.eval()— it disables dropout and changes batch norm behavior. - ⚠️ Not monitoring validation loss: You won't catch overfitting without a separate validation set.
- ⚠️ Gradient accumulation math: If accumulating over N steps, effective batch size = micro_batch × N. Scale LR accordingly.
○ Interview Angles¶
- Q: What optimizer do you use for training Transformers and why?
-
A: AdamW. It's Adam with decoupled weight decay, which provides better regularization for Transformers. Adam adapts the learning rate per-parameter using running estimates of gradient mean and variance.
-
Q: How would you handle GPU memory limitations when training?
- A: (1) Reduce batch size + gradient accumulation, (2) Mixed precision (BF16), (3) Gradient checkpointing, (4) LoRA/QLoRA (train small adapters not full model), (5) DeepSpeed ZeRO / FSDP (distribute across GPUs).
★ Code & Implementation¶
Backpropagation from Scratch + PyTorch Comparison¶
# pip install torch>=2.3
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3
import torch
import torch.nn as nn
import torch.nn.functional as F
# ═══ Manual 2-layer MLP forward + backward ═══
torch.manual_seed(42)
X = torch.randn(32, 10) # 32 samples, 10 features
y = torch.randint(0, 3, (32,)) # 3-class labels
# Weights
W1 = torch.randn(10, 64, requires_grad=True)
b1 = torch.zeros(64, requires_grad=True)
W2 = torch.randn(64, 3, requires_grad=True)
b2 = torch.zeros(3, requires_grad=True)
lr = 1e-3
for epoch in range(5):
# Forward
h = F.relu(X @ W1 + b1) # (32, 64)
out = h @ W2 + b2 # (32, 3)
loss = F.cross_entropy(out, y)
# Backward
loss.backward()
# SGD update
with torch.no_grad():
W1 -= lr * W1.grad; W1.grad.zero_()
b1 -= lr * b1.grad; b1.grad.zero_()
W2 -= lr * W2.grad; W2.grad.zero_()
b2 -= lr * b2.grad; b2.grad.zero_()
print(f"Epoch {epoch+1}: loss={loss.item():.4f}")
# ═══ Same with nn.Module ═══ (idiomatic PyTorch)
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(10, 64), nn.ReLU(),
nn.Linear(64, 3),
)
def forward(self, x): return self.net(x)
model = MLP()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(5):
optimizer.zero_grad()
loss = F.cross_entropy(model(X), y)
loss.backward()
optimizer.step()
print(f"nn.Module final loss: {loss.item():.4f}")
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Neural Networks, Linear Algebra For Ai, Probability And Statistics |
| Leads to | Transformers, Fine Tuning, Inference Optimization |
| Compare with | Classical ML training (scikit-learn — much simpler) |
| Cross-domain | Optimization theory, Numerical methods, Systems engineering |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Vanishing/exploding gradients | Training loss plateaus or diverges | Deep networks with poor initialization or no normalization | Layer normalization, residual connections, careful init |
| Overfitting on small datasets | Training accuracy 99% but test accuracy 60% | Insufficient regularization, model too large for data | Dropout, weight decay, data augmentation, early stopping |
| Learning rate pathology | Training never converges or oscillates | LR too high/low, no schedule | LR finder, cosine annealing, warmup + decay |
◆ Hands-On Exercises¶
Exercise 1: Diagnose Training Pathologies¶
Goal: Deliberately cause and then fix common training failures Time: 30 minutes Steps: 1. Train a small neural network on MNIST 2. Remove batch normalization — observe gradient issues 3. Set learning rate to 1.0 — observe divergence 4. Fix each issue and plot the corrected training curves Expected Output: Before/after training curves for each pathology
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📘 Book | "Deep Learning" by Goodfellow, Bengio, Courville (2016) | The definitive deep learning textbook |
| 🎓 Course | fast.ai — Practical Deep Learning | Best practical introduction to deep learning |
| 🎥 Video | 3Blue1Brown — "Neural Networks" | Beautiful visual explanations of DL concepts |
★ Sources¶
- Karpathy, "Let's Build GPT from Scratch" — https://youtube.com/watch?v=kCc8FmEb1nY
- Ian Goodfellow, "Deep Learning" Chapters 7-8 (Regularization, Optimization)
- PyTorch Training Tutorial — https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html
- Loshchilov & Hutter, "Decoupled Weight Decay Regularization" (AdamW, 2017)