Neural Networks¶

✨ Bit: A neural network is just layers of math (multiply, add, apply function, repeat). We call them "neurons" because marketing, not because they actually work like brains.

★ TL;DR¶

What: Computational models made of layers of interconnected nodes that learn patterns from data by adjusting weights
Why: THE foundation of deep learning and GenAI. Every LLM, every diffusion model, every AI agent runs on neural networks underneath.
Key point: Input → multiply by weights → add bias → apply activation function → output. Stack layers of this = deep learning.

★ Overview¶

Definition¶

A neural network is a function approximator made of layers of interconnected nodes (neurons). Each connection has a learnable weight. By adjusting these weights through training (backpropagation), the network learns to map inputs to desired outputs.

Scope¶

Covers the building blocks needed to understand GenAI architectures. For the specific architecture powering LLMs, see Transformers. For training details, see Deep Learning Fundamentals.

Significance¶

Every GenAI model is a neural network
Understanding neurons → layers → architectures is required to understand Transformers, diffusion, etc.
Activation functions, backpropagation, and gradient flow are concepts you'll see EVERYWHERE

Prerequisites¶

Linear Algebra For Ai — matrix multiplication is the core operation
Basic Python For Ai — for code examples

★ Deep Dive¶

The Neuron (Simplest Unit)¶

A single neuron:

  Inputs       Weights
  x₁ ──── w₁ ──┐
  x₂ ──── w₂ ──┼──► Σ(wᵢxᵢ + b) ──► activation(z) ──► output
  x₃ ──── w₃ ──┘     ↑                    ↑
                    bias (b)         e.g., ReLU, Sigmoid

MATH:
  z = w₁x₁ + w₂x₂ + w₃x₃ + b    (weighted sum + bias)
  output = activation(z)            (apply non-linearity)

In matrix form:
  z = W·x + b          ← This is why linear algebra matters!
  output = σ(z)         ← σ = activation function

Layers¶

INPUT LAYER        HIDDEN LAYERS           OUTPUT LAYER
(your data)        (learned features)      (prediction)

  x₁ ─────┐    ┌──── h₁ ────┐    ┌──── h₅ ────┐
           ├────┤            ├────┤            ├──── ŷ₁
  x₂ ─────┤    ├──── h₂ ────┤    ├──── h₆ ────┤
           ├────┤            ├────┤            ├──── ŷ₂
  x₃ ─────┤    ├──── h₃ ────┤    └──── h₇ ────┘
           ├────┤            │
  x₄ ─────┘    └──── h₄ ────┘

  Layer 0         Layer 1          Layer 2        Layer 3
  (4 neurons)     (4 neurons)      (3 neurons)    (2 neurons)

  "Deep" = More than 1 hidden layer
  GPT-5 has ~100+ layers with billions of neurons

Activation Functions (Why Non-Linearity Matters)¶

WITHOUT activation: Each layer is just W·x + b
  Layer 1: y = W₁x + b₁
  Layer 2: y = W₂(W₁x + b₁) + b₂ = W₂W₁x + W₂b₁ + b₂

  This collapses to: y = W'x + b'  ← Still just a LINEAR function!
  No matter how many layers, it's just one big linear transform.
  Linear functions can only draw straight lines. Useless for complex data.

WITH activation: Non-linearity lets the network learn ANY function.

Function	Formula	Graph Shape	When to Use
ReLU	max(0, x)	`___/`	Default for hidden layers (fast, simple)
GELU	x · Φ(x)	Smooth `___/`	Transformers (GPT, BERT) — smoother than ReLU
Sigmoid	1/(1+e⁻ˣ)	S-curve [0,1]	Output layer for binary classification
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	S-curve [-1,1]	When you need centered outputs
Softmax	eˣⁱ/Σeˣʲ	Probabilities	Output layer for multi-class (next-token prediction!)
SiLU/Swish	x · sigmoid(x)	Smooth `___/`	Modern architectures (LLaMA, Mistral)

For GenAI: GELU is used in GPT/BERT. SiLU/Swish is used in LLaMA/Mistral. Softmax is the output for every LLM (probability over vocabulary).

Backpropagation (How Networks Learn)¶

THE TRAINING LOOP:

  ┌─── 1. FORWARD PASS ──────────────────────────────┐
  │    Push input through network → get prediction     │
  │    x → Layer1 → Layer2 → ... → ŷ (prediction)     │
  └────────────────────────────────────────────────────┘
                         │
                         ▼
  ┌─── 2. COMPUTE LOSS ──────────────────────────────┐
  │    How wrong is the prediction?                    │
  │    Loss = L(ŷ, y)  e.g., cross-entropy, MSE      │
  └────────────────────────────────────────────────────┘
                         │
                         ▼
  ┌─── 3. BACKWARD PASS (Backpropagation) ───────────┐
  │    Calculate: ∂Loss/∂w for EVERY weight            │
  │    Uses the chain rule of calculus:                 │
  │    ∂L/∂w₁ = ∂L/∂ŷ · ∂ŷ/∂h₂ · ∂h₂/∂h₁ · ∂h₁/∂w₁│
  │    "How much did each weight contribute to error?"  │
  └────────────────────────────────────────────────────┘
                         │
                         ▼
  ┌─── 4. UPDATE WEIGHTS ────────────────────────────┐
  │    w_new = w_old - learning_rate × gradient        │
  │    Move each weight a tiny step to reduce loss     │
  └────────────────────────────────────────────────────┘
                         │
                         ▼
              Repeat for thousands of iterations
              until loss is small enough

The chain rule is the mathematical core. You don't need to compute it manually — PyTorch's autograd does it automatically. But understanding the concept is crucial.

Network Types (Building Blocks for GenAI)¶

Type	Architecture	What It's Good At	GenAI Relevance
Feed-Forward (FFN)	Input → Hidden → Output	Simple classification/regression	Used INSIDE Transformers (the MLP block)
CNN	Convolutional filters + pooling	Images, spatial patterns	Vision encoders (ViT combines CNN ideas + attention)
RNN/LSTM	Sequential processing, hidden state	Sequential data (text, time series)	Replaced by Transformers (RNNs can't parallelize)
Transformer	Self-attention + FFN	Everything (text, image, video)	THE architecture of modern GenAI

Evolution:  FFN (1980s) → CNN (1998) → RNN/LSTM (2014) → Transformer (2017)
                                                           ↑
                                                    This won. Everything else
                                                    is supporting architecture.

◆ Code & Implementation¶

# ⚠️ Last tested: 2026-04
# ═══ SIMPLE NEURAL NETWORK IN PYTORCH ═══
import torch
import torch.nn as nn

# Define a 3-layer neural network
class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)   # Input → Hidden
        self.layer2 = nn.Linear(hidden_size, hidden_size)  # Hidden → Hidden
        self.layer3 = nn.Linear(hidden_size, output_size)  # Hidden → Output
        self.relu = nn.ReLU()                               # Activation

    def forward(self, x):
        x = self.relu(self.layer1(x))   # Layer 1 + ReLU
        x = self.relu(self.layer2(x))   # Layer 2 + ReLU
        x = self.layer3(x)              # Output (no activation — loss handles it)
        return x

# Create model
model = SimpleNN(input_size=784, hidden_size=256, output_size=10)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# → Parameters: 268,810  (tiny! GPT-5 has ~1 trillion)

# Training step
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

# One training step:
input_data = torch.randn(32, 784)     # Batch of 32, 784 features
labels = torch.randint(0, 10, (32,))  # 32 labels (0-9)

prediction = model(input_data)         # 1. Forward pass
loss = loss_fn(prediction, labels)     # 2. Compute loss
loss.backward()                        # 3. Backpropagation (autograd!)
optimizer.step()                       # 4. Update weights
optimizer.zero_grad()                  # Reset gradients for next step

◆ Quick Reference¶

BUILDING BLOCKS:
  Neuron = weights × inputs + bias → activation
  Layer  = many neurons processing in parallel
  Network = stack of layers
  Deep   = more than 1 hidden layer

KEY NUMBERS:
  Simple NN:    ~100K-1M parameters
  CNN (ResNet): ~25M parameters
  BERT:         ~110M-340M parameters
  GPT-4:        ~1.8T parameters (estimated)
  GPT-5:        ~1T+ parameters

ACTIVATION CHOICE:
  Hidden layers → ReLU (default), GELU (Transformers), SiLU (LLaMA)
  Binary output → Sigmoid
  Multi-class  → Softmax
  Regression   → None (linear output)

○ Gotchas & Common Mistakes¶

⚠️ "Neural networks work like brains": No. They're matrix multiplications with non-linear functions. The analogy is marketing.
⚠️ Vanishing gradients: In very deep networks, gradients can shrink to near-zero. ReLU and residual connections fix this.
⚠️ More layers ≠ always better: Without proper techniques (residuals, normalization), deeper networks can actually perform worse.
⚠️ Overfitting: Network memorizes training data instead of learning patterns. Use dropout, regularization, more data.

○ Interview Angles¶

Q: What does backpropagation actually compute?
A: The gradient of the loss function with respect to every weight in the network, using the chain rule of calculus. These gradients tell us how to adjust each weight to reduce the error.
Q: Why do we need activation functions?
A: Without non-linear activation, any number of layers collapses to a single linear transformation (y = Wx + b). Non-linearity lets the network approximate any function, not just lines/planes.

★ Connections¶

Relationship	Topics
Builds on	Linear Algebra For Ai, Probability And Statistics
Leads to	Transformers, Deep Learning Fundamentals
Compare with	Decision trees, SVMs (simpler ML models)
Cross-domain	Neuroscience (loose inspiration), Control theory

★ Recommended Resources¶

Type	Resource	Why
🎥 Video	3Blue1Brown — "What is a Neural Network?"	Best visual introduction to neural networks
🎓 Course	Stanford CS231n	Deep dive into neural network architectures
📘 Book	"Deep Learning" by Goodfellow, Bengio, Courville (2016), Ch 6	Mathematical foundations of feedforward networks

★ Sources¶

3Blue1Brown "Neural Networks" series — https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
Google "Neural Networks" course — https://developers.google.com/machine-learning/crash-course
Ian Goodfellow, "Deep Learning" textbook (2016) — Chapter 6
PyTorch Tutorials — https://pytorch.org/tutorials/