Transformers¶
✨ Bit: The paper was titled "Attention Is All You Need" — turns out, attention + ungodly amounts of compute + internet-scale data is what you actually need.
★ TL;DR¶
- What: A neural network architecture based on self-attention that processes entire sequences in parallel
- Why: Replaced RNNs/LSTMs. Foundation of ALL modern LLMs and most GenAI models
- Key point: Parallelism + attention = trains faster and captures long-range dependencies better than anything before it
★ Overview¶
Definition¶
The Transformer is a deep learning architecture introduced in 2017 by Vaswani et al. It uses a mechanism called self-attention (see Attention Mechanism) to process input sequences in parallel rather than sequentially, making it dramatically faster to train and better at capturing relationships between distant elements in a sequence.
Scope¶
This document covers the Transformer architecture itself. For attention mechanism deep dive, see Attention Mechanism. For specific models built on Transformers, see Large Language Models (LLMs).
Significance¶
- Before Transformers: RNNs/LSTMs processed sequences one step at a time → slow, couldn't handle long sequences
- After Transformers: Parallel processing + attention → scalable to billions of parameters
- Impact: GPT, BERT, T5, LLaMA, Gemini, Claude — ALL are Transformer variants.
Prerequisites¶
- Neural Networks — basic neural network concepts
- Embeddings — vector representations
★ Deep Dive¶
The Original Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ TRANSFORMER ARCHITECTURE │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ ENCODER │ │ DECODER │ │
│ │ (understands) │ │ (generates) │ │
│ │ │ │ │ │
│ │ ┌───────────────┐ │ ┌──→ │ ┌───────────────┐ │ │
│ │ │ Multi-Head │ │ │ │ │ Masked │ │ │
│ │ │ Self-Attention│ │ │ │ │ Self-Attention│ │ │
│ │ └───────┬───────┘ │ │ │ └───────┬───────┘ │ │
│ │ ↓ │ │ │ ↓ │ │
│ │ ┌───────────────┐ │ │ │ ┌───────────────┐ │ │
│ │ │ Add & Norm │ │ │ │ │ Cross- │ │ │
│ │ └───────┬───────┘ │ │ │ │ Attention │ │ │
│ │ ↓ │ │ │ │ (to encoder) │ │ │
│ │ ┌───────────────┐ │ │ │ └───────┬───────┘ │ │
│ │ │ Feed-Forward │ │ │ │ ↓ │ │
│ │ │ Network │ │────┘ │ ┌───────────────┐ │ │
│ │ └───────┬───────┘ │ │ │ Feed-Forward │ │ │
│ │ ↓ │ │ │ Network │ │ │
│ │ ┌───────────────┐ │ │ └───────┬───────┘ │ │
│ │ │ Add & Norm │ │ │ ↓ │ │
│ │ └───────────────┘ │ │ Output Probs │ │
│ │ │ │ │ │
│ │ × N layers │ │ × N layers │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ Input: Token Embeddings + Positional Encoding │
└─────────────────────────────────────────────────────────────┘
Key Components Explained¶
1. Input Embeddings + Positional Encoding¶
Tokens (words/subwords) are converted to dense vectors. Since Transformers process all tokens in parallel (no sequential order), positional encoding is added to inject position information.
Original paper uses sinusoidal encoding. Modern models often use learned positional embeddings or RoPE (Rotary Position Embeddings).
2. Self-Attention (The Core Innovation)¶
Each token looks at ALL other tokens to decide what's important. See Attention Mechanism for full deep dive.
Simplified intuition: For the sentence "The cat sat on the mat because it was tired" — self-attention lets "it" attend strongly to "cat" to understand the reference.
3. Multi-Head Attention¶
Instead of one attention computation, run multiple in parallel (multiple "heads"). Each head can learn different relationship types: - Head 1 might learn syntactic relationships - Head 2 might learn semantic relationships - Head 3 might learn positional relationships
4. Feed-Forward Network (FFN)¶
After attention, each position passes through the same 2-layer network independently:
This is where the model stores "knowledge" — factual information learned during training. The FFN acts as a key-value memory.
5. Residual Connections + Layer Norm¶
Every sub-layer has a residual connection (skip connection) and layer normalization:
This prevents vanishing gradients and enables training very deep networks (100+ layers).
Encoder vs Decoder vs Both¶
| Variant | Architecture | Models | Use Case |
|---|---|---|---|
| Encoder-only | Just the encoder stack | BERT, RoBERTa | Understanding: classification, NER, embeddings |
| Decoder-only | Just the decoder stack | GPT, LLaMA, Claude | Generation: text completion, chat |
| Encoder-Decoder | Both stacks | T5, BART, original Transformer | Seq2seq: translation, summarization |
Modern trend: Decoder-only dominates for GenAI because generation IS the task.
Modern Improvements Over Original¶
| Improvement | What Changed | Used In |
|---|---|---|
| RoPE | Rotary position embeddings (better than sinusoidal) | LLaMA, Qwen, Mistral |
| GQA | Grouped Query Attention (efficiency) | LLaMA 2+, Gemini |
| MoE | Mixture of Experts (sparse activation) | LLaMA 4, Mixtral, GPT-4 (rumored) |
| SwiGLU | Better activation function in FFN | LLaMA, PaLM |
| RMSNorm | Simpler normalization (pre-norm) | LLaMA, Gemma |
| Flash Attention | Memory-efficient attention computation | Nearly all modern models |
| KV Cache | Cache key/value for faster inference | All autoregressive models |
◆ Terminology¶
| Term | Meaning |
|---|---|
| Token | Smallest unit of text the model processes (word piece, ~4 chars in English) |
| Embedding | Dense vector representation of a token |
| Attention Score | How much one token should "pay attention to" another |
| Head | One parallel attention computation |
| Layer | One complete block (attention + FFN + norms) |
| Context Window | Maximum number of tokens the model can process at once |
| KV Cache | Stored key-value pairs from previous tokens to speed up generation |
| MoE | Mixture of Experts — only activates a subset of parameters per token |
◆ Formulas & Equations¶
| Name | Formula | Variables | Use |
|---|---|---|---|
| Attention | $$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ | Q=queries, K=keys, V=values, d_k=key dimension | Core attention computation |
| Positional Encoding | $$PE_{(pos,2i)} = \sin(pos/10000^{2i/d})$$ | pos=position, i=dimension index, d=model dimension | Inject position info |
| FFN | $$FFN(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$ | W₁, W₂=weight matrices | Process each position independently |
◆ Strengths vs Limitations¶
| ✅ Strengths | ❌ Limitations |
|---|---|
| Parallelizable (unlike RNNs) → fast training | Quadratic memory/compute with sequence length (O(n²)) |
| Captures long-range dependencies via attention | Fixed context window (though growing: 1M-10M tokens) |
| Scales predictably with more data/compute | Massive compute requirements for training |
| Transfer learning works incredibly well | Positional encoding schemes still imperfect |
| Architecture is simple and modular | No inherent understanding of time/causality |
◆ Quick Reference¶
Transformer Block:
Input → [Multi-Head Attention] → Add & Norm → [FFN] → Add & Norm → Output
Key Dimensions (GPT-3 175B example):
- Layers: 96
- Heads: 96
- d_model: 12288
- d_ff: 49152 (4x d_model)
- Context: 2048 tokens
Modern Scaling (LLaMA 4 Behemoth):
- Parameters: 2T+ (but MoE, so ~288B active)
- Context: 10M tokens (Scout variant)
○ Interview Angles¶
- Q: Why do Transformers use scaled dot-product attention (divide by √d_k)?
-
A: Without scaling, dot products grow large with high dimensions, pushing softmax into regions with tiny gradients. Dividing by √d_k keeps gradients healthy.
-
Q: What's the computational complexity of self-attention?
-
A: O(n²·d) where n is sequence length and d is dimension. This quadratic scaling with n is the main bottleneck for long sequences.
-
Q: Why decoder-only for generation instead of encoder-decoder?
- A: Simpler architecture, easier to scale, and with enough data the decoder learns to "encode" implicitly. Also, causal masking naturally fits left-to-right generation.
★ Code & Implementation¶
Load and Run a Transformer-Based LLM (HuggingFace)¶
# pip install transformers>=4.40 torch>=2.3
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40, torch>=2.3
# CPU mode: runs slowly but works for learning. For GPU: set device_map="auto"
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "google/gemma-2-2b-it" # ~5GB download; swap for any instruct model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto", # auto-distributes to available GPU/CPU
attn_implementation="eager", # use "flash_attention_2" on GPU with CUDA
)
prompt = [{"role": "user", "content": "Explain the transformer architecture in 3 sentences."}]
inputs = tokenizer.apply_chat_template(
prompt, return_tensors="pt", add_generation_prompt=True
).to(model.device)
with torch.inference_mode():
outputs = model.generate(
inputs,
max_new_tokens=150,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
Minimal Transformer Block in PyTorch¶
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3
import torch
import torch.nn as nn
import torch.nn.functional as F
class TransformerBlock(nn.Module):
"""Single transformer decoder block: MHA + FFN + residuals."""
def __init__(self, d_model: int = 512, n_heads: int = 8, d_ff: int = 2048):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
seq_len = x.size(1)
# Causal mask: prevent attending to future tokens
causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1).bool()
# Pre-norm (modern style: norm before attention, not after)
attn_out, _ = self.attn(self.norm1(x), self.norm1(x), self.norm1(x), attn_mask=causal_mask)
x = x + attn_out # residual connection
x = x + self.ffn(self.norm2(x)) # FFN + residual
return x
# Test:
block = TransformerBlock(d_model=64, n_heads=4, d_ff=256)
dummy = torch.randn(2, 10, 64) # batch=2, seq_len=10, d_model=64
out = block(dummy)
print(f"Input shape: {dummy.shape} → Output shape: {out.shape}") # Should match
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Neural Networks, Embeddings, Attention Mechanism |
| Leads to | Large Language Models (LLMs), Diffusion Models |
| Compare with | RNNs (sequential), LSTMs (gated sequential), CNNs (local patterns) |
| Cross-domain | Graph attention networks (GNNs), Vision Transformers (ViT) |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Attention bottleneck | Inference latency grows quadratically with sequence length | O(n²) self-attention complexity | FlashAttention, sparse attention, SSM alternatives |
| Positional encoding limits | Quality degrades beyond training context length | Fixed positional encodings don't extrapolate | RoPE with NTK scaling, ALiBi, position interpolation |
| KV-cache memory explosion | OOM during batch inference with long sequences | KV-cache grows linearly per layer per head per token | GQA/MQA, KV-cache quantization, paged attention (vLLM) |
◆ Hands-On Exercises¶
Exercise 1: Implement Scaled Dot-Product Attention from Scratch¶
Goal: Build attention in pure PyTorch and verify against the built-in
Time: 30 minutes
Steps:
1. Implement Q·K^T/√d_k → softmax → ·V in PyTorch
2. Add causal mask
3. Compare output against torch.nn.functional.scaled_dot_product_attention
4. Verify outputs match to 1e-5 tolerance
Expected Output: Matching outputs and attention weight visualization
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Vaswani et al. "Attention Is All You Need" (2017) | The foundational transformer paper — read Sections 3-4 |
| 🎥 Video | 3Blue1Brown — "Attention in Transformers" | Best visual explanation of how attention works |
| 🎓 Course | Stanford CS224n: NLP with Deep Learning | Gold standard NLP course covering transformers in depth |
| 📘 Book | "Build a Large Language Model (From Scratch)" by Sebastian Raschka (2024), Ch 3 | Step-by-step transformer implementation in PyTorch |
★ Sources¶
- Vaswani et al., "Attention Is All You Need" (2017) — https://arxiv.org/abs/1706.03762
- "The Illustrated Transformer" by Jay Alammar — https://jalammar.github.io/illustrated-transformer/
- Andrej Karpathy, "Let's build GPT from scratch" — YouTube lecture
- "Formal Algorithms for Transformers" (Phuong & Hutter, 2022)