Transformers¶

✨ Bit: The paper was titled "Attention Is All You Need" — turns out, attention + ungodly amounts of compute + internet-scale data is what you actually need.

★ TL;DR¶

What: A neural network architecture based on self-attention that processes entire sequences in parallel
Why: Replaced RNNs/LSTMs. Foundation of ALL modern LLMs and most GenAI models
Key point: Parallelism + attention = trains faster and captures long-range dependencies better than anything before it

★ Overview¶

Definition¶

The Transformer is a deep learning architecture introduced in 2017 by Vaswani et al. It uses a mechanism called self-attention (see Attention Mechanism) to process input sequences in parallel rather than sequentially, making it dramatically faster to train and better at capturing relationships between distant elements in a sequence.

Scope¶

This document covers the Transformer architecture itself. For attention mechanism deep dive, see Attention Mechanism. For specific models built on Transformers, see Large Language Models (LLMs).

Significance¶

Before Transformers: RNNs/LSTMs processed sequences one step at a time → slow, couldn't handle long sequences
After Transformers: Parallel processing + attention → scalable to billions of parameters
Impact: GPT, BERT, T5, LLaMA, Gemini, Claude — ALL are Transformer variants.

Prerequisites¶

Neural Networks — basic neural network concepts
Embeddings — vector representations

★ Deep Dive¶

The Original Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                    TRANSFORMER ARCHITECTURE                  │
│                                                              │
│  ┌──────────────────┐          ┌──────────────────┐         │
│  │     ENCODER       │          │     DECODER       │        │
│  │  (understands)    │          │   (generates)     │        │
│  │                   │          │                   │        │
│  │ ┌───────────────┐ │    ┌──→ │ ┌───────────────┐ │        │
│  │ │ Multi-Head    │ │    │    │ │ Masked        │ │        │
│  │ │ Self-Attention│ │    │    │ │ Self-Attention│ │        │
│  │ └───────┬───────┘ │    │    │ └───────┬───────┘ │        │
│  │         ↓         │    │    │         ↓         │        │
│  │ ┌───────────────┐ │    │    │ ┌───────────────┐ │        │
│  │ │ Add & Norm    │ │    │    │ │ Cross-        │ │        │
│  │ └───────┬───────┘ │    │    │ │ Attention     │ │        │
│  │         ↓         │    │    │ │ (to encoder)  │ │        │
│  │ ┌───────────────┐ │    │    │ └───────┬───────┘ │        │
│  │ │ Feed-Forward  │ │    │    │         ↓         │        │
│  │ │ Network       │ │────┘    │ ┌───────────────┐ │        │
│  │ └───────┬───────┘ │         │ │ Feed-Forward  │ │        │
│  │         ↓         │         │ │ Network       │ │        │
│  │ ┌───────────────┐ │         │ └───────┬───────┘ │        │
│  │ │ Add & Norm    │ │         │         ↓         │        │
│  │ └───────────────┘ │         │    Output Probs   │        │
│  │                   │         │                   │        │
│  │   × N layers      │         │   × N layers      │        │
│  └──────────────────┘          └──────────────────┘         │
│                                                              │
│  Input: Token Embeddings + Positional Encoding               │
└─────────────────────────────────────────────────────────────┘

Key Components Explained¶

1. Input Embeddings + Positional Encoding¶

Tokens (words/subwords) are converted to dense vectors. Since Transformers process all tokens in parallel (no sequential order), positional encoding is added to inject position information.

Input = Token_Embedding(x) + Positional_Encoding(position)

Original paper uses sinusoidal encoding. Modern models often use learned positional embeddings or RoPE (Rotary Position Embeddings).

2. Self-Attention (The Core Innovation)¶

Each token looks at ALL other tokens to decide what's important. See Attention Mechanism for full deep dive.

Simplified intuition: For the sentence "The cat sat on the mat because it was tired" — self-attention lets "it" attend strongly to "cat" to understand the reference.

3. Multi-Head Attention¶

Instead of one attention computation, run multiple in parallel (multiple "heads"). Each head can learn different relationship types: - Head 1 might learn syntactic relationships - Head 2 might learn semantic relationships - Head 3 might learn positional relationships

4. Feed-Forward Network (FFN)¶

After attention, each position passes through the same 2-layer network independently:

FFN(x) = ReLU(x·W₁ + b₁)·W₂ + b₂

This is where the model stores "knowledge" — factual information learned during training. The FFN acts as a key-value memory.

5. Residual Connections + Layer Norm¶

Every sub-layer has a residual connection (skip connection) and layer normalization:

output = LayerNorm(x + SubLayer(x))

This prevents vanishing gradients and enables training very deep networks (100+ layers).

Encoder vs Decoder vs Both¶

Variant	Architecture	Models	Use Case
Encoder-only	Just the encoder stack	BERT, RoBERTa	Understanding: classification, NER, embeddings
Decoder-only	Just the decoder stack	GPT, LLaMA, Claude	Generation: text completion, chat
Encoder-Decoder	Both stacks	T5, BART, original Transformer	Seq2seq: translation, summarization

Modern trend: Decoder-only dominates for GenAI because generation IS the task.

Modern Improvements Over Original¶

Improvement	What Changed	Used In
RoPE	Rotary position embeddings (better than sinusoidal)	LLaMA, Qwen, Mistral
GQA	Grouped Query Attention (efficiency)	LLaMA 2+, Gemini
MoE	Mixture of Experts (sparse activation)	LLaMA 4, Mixtral, GPT-4 (rumored)
SwiGLU	Better activation function in FFN	LLaMA, PaLM
RMSNorm	Simpler normalization (pre-norm)	LLaMA, Gemma
Flash Attention	Memory-efficient attention computation	Nearly all modern models
KV Cache	Cache key/value for faster inference	All autoregressive models

◆ Terminology¶

Term	Meaning
Token	Smallest unit of text the model processes (word piece, ~4 chars in English)
Embedding	Dense vector representation of a token
Attention Score	How much one token should "pay attention to" another
Head	One parallel attention computation
Layer	One complete block (attention + FFN + norms)
Context Window	Maximum number of tokens the model can process at once
KV Cache	Stored key-value pairs from previous tokens to speed up generation
MoE	Mixture of Experts — only activates a subset of parameters per token

◆ Formulas & Equations¶

Name	Formula	Variables	Use
Attention	$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$	Q=queries, K=keys, V=values, d_k=key dimension	Core attention computation
Positional Encoding	$$PE_{(pos,2i)} = \sin(pos/10000^{2i/d})$$	pos=position, i=dimension index, d=model dimension	Inject position info
FFN	$$FFN(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$	W₁, W₂=weight matrices	Process each position independently

◆ Strengths vs Limitations¶

✅ Strengths	❌ Limitations
Parallelizable (unlike RNNs) → fast training	Quadratic memory/compute with sequence length (O(n²))
Captures long-range dependencies via attention	Fixed context window (though growing: 1M-10M tokens)
Scales predictably with more data/compute	Massive compute requirements for training
Transfer learning works incredibly well	Positional encoding schemes still imperfect
Architecture is simple and modular	No inherent understanding of time/causality

◆ Quick Reference¶

Transformer Block:
  Input → [Multi-Head Attention] → Add & Norm → [FFN] → Add & Norm → Output

Key Dimensions (GPT-3 175B example):
  - Layers: 96
  - Heads: 96
  - d_model: 12288
  - d_ff: 49152 (4x d_model)
  - Context: 2048 tokens

Modern Scaling (LLaMA 4 Behemoth):
  - Parameters: 2T+ (but MoE, so ~288B active)
  - Context: 10M tokens (Scout variant)

○ Interview Angles¶

Q: Why do Transformers use scaled dot-product attention (divide by √d_k)?
A: Without scaling, dot products grow large with high dimensions, pushing softmax into regions with tiny gradients. Dividing by √d_k keeps gradients healthy.
Q: What's the computational complexity of self-attention?
A: O(n²·d) where n is sequence length and d is dimension. This quadratic scaling with n is the main bottleneck for long sequences.
Q: Why decoder-only for generation instead of encoder-decoder?
A: Simpler architecture, easier to scale, and with enough data the decoder learns to "encode" implicitly. Also, causal masking naturally fits left-to-right generation.

★ Code & Implementation¶

Load and Run a Transformer-Based LLM (HuggingFace)¶

# pip install transformers>=4.40 torch>=2.3
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40, torch>=2.3
# CPU mode: runs slowly but works for learning. For GPU: set device_map="auto"

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "google/gemma-2-2b-it"  # ~5GB download; swap for any instruct model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",             # auto-distributes to available GPU/CPU
    attn_implementation="eager",   # use "flash_attention_2" on GPU with CUDA
)

prompt = [{"role": "user", "content": "Explain the transformer architecture in 3 sentences."}]
inputs = tokenizer.apply_chat_template(
    prompt, return_tensors="pt", add_generation_prompt=True
).to(model.device)

with torch.inference_mode():
    outputs = model.generate(
        inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)

Minimal Transformer Block in PyTorch¶

# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerBlock(nn.Module):
    """Single transformer decoder block: MHA + FFN + residuals."""
    def __init__(self, d_model: int = 512, n_heads: int = 8, d_ff: int = 2048):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ffn  = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        seq_len = x.size(1)
        # Causal mask: prevent attending to future tokens
        causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1).bool()
        # Pre-norm (modern style: norm before attention, not after)
        attn_out, _ = self.attn(self.norm1(x), self.norm1(x), self.norm1(x), attn_mask=causal_mask)
        x = x + attn_out                   # residual connection
        x = x + self.ffn(self.norm2(x))    # FFN + residual
        return x

# Test:
block = TransformerBlock(d_model=64, n_heads=4, d_ff=256)
dummy = torch.randn(2, 10, 64)  # batch=2, seq_len=10, d_model=64
out = block(dummy)
print(f"Input shape: {dummy.shape} → Output shape: {out.shape}")  # Should match

★ Connections¶

Relationship	Topics
Builds on	Neural Networks, Embeddings, Attention Mechanism
Leads to	Large Language Models (LLMs), Diffusion Models
Compare with	RNNs (sequential), LSTMs (gated sequential), CNNs (local patterns)
Cross-domain	Graph attention networks (GNNs), Vision Transformers (ViT)

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Attention bottleneck	Inference latency grows quadratically with sequence length	O(n²) self-attention complexity	FlashAttention, sparse attention, SSM alternatives
Positional encoding limits	Quality degrades beyond training context length	Fixed positional encodings don't extrapolate	RoPE with NTK scaling, ALiBi, position interpolation
KV-cache memory explosion	OOM during batch inference with long sequences	KV-cache grows linearly per layer per head per token	GQA/MQA, KV-cache quantization, paged attention (vLLM)

◆ Hands-On Exercises¶

Exercise 1: Implement Scaled Dot-Product Attention from Scratch¶

Goal: Build attention in pure PyTorch and verify against the built-in Time: 30 minutes Steps: 1. Implement Q·K^T/√d_k → softmax → ·V in PyTorch 2. Add causal mask 3. Compare output against torch.nn.functional.scaled_dot_product_attention 4. Verify outputs match to 1e-5 tolerance Expected Output: Matching outputs and attention weight visualization

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Vaswani et al. "Attention Is All You Need" (2017)	The foundational transformer paper — read Sections 3-4
🎥 Video	3Blue1Brown — "Attention in Transformers"	Best visual explanation of how attention works
🎓 Course	Stanford CS224n: NLP with Deep Learning	Gold standard NLP course covering transformers in depth
📘 Book	"Build a Large Language Model (From Scratch)" by Sebastian Raschka (2024), Ch 3	Step-by-step transformer implementation in PyTorch

★ Sources¶

Vaswani et al., "Attention Is All You Need" (2017) — https://arxiv.org/abs/1706.03762
"The Illustrated Transformer" by Jay Alammar — https://jalammar.github.io/illustrated-transformer/
Andrej Karpathy, "Let's build GPT from scratch" — YouTube lecture
"Formal Algorithms for Transformers" (Phuong & Hutter, 2022)