Modern LLM Architectures¶
✨ Bit: The original Transformer (2017) is like a Model T Ford. Every modern LLM has upgraded every component — MoE for efficiency, GQA for memory, RoPE for position, Flash Attention for speed. Same soul, completely different car.
★ TL;DR¶
- What: The architectural innovations that make modern LLMs (GPT-5, LLaMA 4, Gemini 3) work — beyond the basic Transformer
- Why: Interviewers ask "what's MoE?" and "how does RoPE work?" These are the building blocks of every frontier model.
- Key point: Modern LLMs aren't just bigger Transformers. They use MoE (activate only part of the model), GQA (save memory), RoPE (handle long sequences), and Flash Attention (go faster).
★ Overview¶
Definition¶
This document covers the key architectural components added to the basic Transformer to create modern LLMs. For the base Transformer architecture, see Transformers. For the attention mechanism, see Attention Mechanism.
Scope¶
Covers: MoE, GQA, RoPE, Flash Attention, normalization choices, and how they combine. Not a full paper review — focused on intuition and practical understanding.
Prerequisites¶
- Transformers — encoder/decoder, self-attention
- Attention Mechanism — Q, K, V matrices
- Linear Algebra For Ai — matrix operations
★ Deep Dive¶
What Changed from the Original Transformer¶
| Component | Original (2017) | Modern (2025) | Why |
|---|---|---|---|
| Experts | Dense (all params active) | MoE (sparse, subset active) | More capacity, less compute |
| Attention heads | Multi-Head (MHA) | GQA (grouped-query) | Less memory for KV cache |
| Position encoding | Sinusoidal (absolute) | RoPE (rotary) | Better at long sequences |
| Attention algorithm | Standard O(n²) | Flash Attention | 2-4x faster, less memory |
| Normalization | Post-LayerNorm | Pre-RMSNorm | Stable training |
| Activation | ReLU | SiLU/SwiGLU | Smoother, better performance |
1. Mixture of Experts (MoE)¶
DENSE MODEL (traditional):
Every token goes through ALL parameters.
LLaMA 70B: 70B params active per token → expensive!
MoE MODEL:
Each layer has N "expert" sub-networks.
A ROUTER decides which experts handle each token.
Only 2-4 experts active per token (out of 16+).
┌─────────────────────────────────────────────────┐
│ MoE LAYER │
│ │
│ Input token │
│ │ │
│ ▼ │
│ ┌────────┐ │
│ │ ROUTER │ ← "Which experts should handle │
│ │(Gating)│ this token?" │
│ └───┬────┘ │
│ │ top-k selection (usually k=2) │
│ ▼ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ... ┌──────┐ │
│ │ E1 ✓ │ │ E2 │ │ E3 ✓ │ │ E16 │ │
│ │active │ │sleep │ │active│ │sleep │ │
│ └──┬───┘ └──────┘ └──┬───┘ └──────┘ │
│ │ │ │
│ ▼ ▼ │
│ weighted combination of active expert outputs │
│ │ │
│ ▼ │
│ Output token │
└─────────────────────────────────────────────────┘
RESULT:
Total params: 600B (huge capacity)
Active params per token: 37B (cheap to run!)
Best of both worlds: capacity of 600B, cost of ~37B
| Model | Total Params | Active Params | Experts | Top-K |
|---|---|---|---|---|
| Mixtral 8x7B | 47B | 13B | 8 | 2 |
| LLaMA 4 Scout | 109B | 17B | 16 | 1 |
| LLaMA 4 Maverick | 400B | 17B | 128 | 1 |
| DeepSeek-V3 | 671B | 37B | 256 | 8 |
| GPT-5 (estimated) | ~1T+ | ~200-300B | MoE | Unknown |
2. Grouped-Query Attention (GQA)¶
THE KV CACHE PROBLEM:
In attention: Q, K, V matrices.
During generation, K and V are CACHED for all past tokens.
Multi-Head Attention (MHA):
Each head has its own K and V matrices.
32 heads × 128 dim × 4096 seq × 2 (K+V) = HUGE memory!
SOLUTIONS (progressive):
MHA (Multi-Head Attention) — Original
┌───────────────────────────────────────┐
│ Q1 K1 V1 │ Q2 K2 V2 │ ... │ Q32 K32 V32
│ Each head has its own Q, K, V
│ KV cache: 32 × 2 = 64 matrices
└───────────────────────────────────────┘
MQA (Multi-Query Attention) — Extreme
┌───────────────────────────────────────┐
│ Q1 Q2 Q3 ... Q32 │ K V
│ All heads SHARE one K and one V
│ KV cache: 1 × 2 = 2 matrices (32x less!)
│ Problem: Quality drops
└───────────────────────────────────────┘
GQA (Grouped-Query Attention) — Sweet spot ✅
┌───────────────────────────────────────┐
│ Group 1: Q1 Q2 Q3 Q4 → K₁ V₁ │
│ Group 2: Q5 Q6 Q7 Q8 → K₂ V₂ │
│ ... │
│ Group 8: Q29..Q32 → K₈ V₈ │
│ KV cache: 8 × 2 = 16 matrices │
│ 4x less memory, near-MHA quality! │
└───────────────────────────────────────┘
Used in: LLaMA 2/3/4, Gemini, Mistral — virtually every modern LLM.
3. RoPE (Rotary Position Embeddings)¶
THE POSITION PROBLEM:
Transformers process all tokens in parallel (no sequential order).
They need explicit position information: "this is token #5."
ORIGINAL: Sinusoidal (Absolute)
Add a fixed vector per position: pos_1, pos_2, ..., pos_512
❌ Can't handle sequences longer than training length
❌ Doesn't capture relative distance well
RoPE: Rotary Position Embeddings
Instead of ADDING position info, ROTATE the Q and K vectors
by an angle proportional to their position.
Key insight: After rotation, the DOT PRODUCT of Q·K
naturally depends on RELATIVE position (distance between tokens).
Token at position 5: rotate Q by 5θ
Token at position 10: rotate Q by 10θ
Dot product captures: they're 5 positions apart
✅ Naturally handles relative positions
✅ Can be extended to longer sequences (NTK-aware scaling, YaRN)
✅ Computationally cheap (just rotation in pairs)
Used in: LLaMA 1/2/3/4, Mistral, Qwen, PaLM — standard in virtually all modern LLMs.
4. Flash Attention¶
THE SPEED PROBLEM:
Standard attention: O(N²) in both time and memory.
4096 tokens → 16 million attention scores → slow + memory-heavy
FLASH ATTENTION SOLUTION:
Don't compute the full NxN attention matrix at once.
Instead, compute it in TILES (blocks) that fit in GPU SRAM.
GPU Memory Hierarchy:
┌────────────────────────────────────────┐
│ SRAM (on-chip cache) │ 20 MB │ ← FAST (10 TB/s)
│ HBM (GPU RAM) │ 40-80 GB │ ← SLOW (2 TB/s)
└────────────────────────────────────────┘
Standard attention: load full matrices from HBM → compute → store
Flash attention: compute in SRAM-sized tiles → never materialize
full attention matrix in HBM
Result:
2-4x faster
Linear memory instead of quadratic
Exact same output (not approximate!)
Versions: FlashAttention-1 (2022) → FlashAttention-2 (2023) → FlashAttention-3 (2024, Hopper GPUs)
5. Normalization & Activation¶
PRE-RMSNORM (replaces Post-LayerNorm):
Original Transformer: Attention → Add → LayerNorm → FFN → Add → LayerNorm
Modern LLM: RMSNorm → Attention → Add → RMSNorm → FFN → Add
RMSNorm = Root Mean Square Normalization
Simpler than LayerNorm (no mean subtraction), faster, works better.
SwiGLU ACTIVATION (replaces ReLU):
SwiGLU(x) = (x · W₁) ⊙ SiLU(x · V)
Three matrices instead of two (W₁, W₂, V)
Better performance empirically, standard in LLaMA/Mistral/GPT.
◆ How They Combine (LLaMA 3 Architecture)¶
LLaMA 3 70B — A complete modern LLM:
Input → Tokenizer (BPE, 128K vocab)
→ Embedding lookup
→ [80 Transformer layers, each:]
├── Pre-RMSNorm
├── GQA Self-Attention (8 KV heads, 64 query heads)
│ └── RoPE positional encoding
│ └── Flash Attention computation
├── Residual connection
├── Pre-RMSNorm
├── SwiGLU Feed-Forward Network
└── Residual connection
→ Final RMSNorm
→ Output head → Softmax → Next token probability
Total: 70B parameters, 8K-128K context
◆ Quick Reference¶
COMPONENT CHEAT SHEET:
MoE → More capacity, less compute (sparse activation)
GQA → Less KV cache memory (grouped key-value sharing)
RoPE → Better position encoding (rotation-based, extensible)
Flash → Faster attention (tiled SRAM computation)
RMSNorm → Simpler, faster normalization
SwiGLU → Better activation function
WHICH MODELS USE WHAT:
LLaMA 3: GQA + RoPE + Flash + RMSNorm + SwiGLU
LLaMA 4: MoE + GQA + RoPE + Flash + RMSNorm + SwiGLU
Mistral: GQA + RoPE + Flash + RMSNorm + SwiGLU + Sliding Window
Mixtral: MoE + GQA + RoPE + Flash + RMSNorm + SwiGLU
DeepSeek: MoE + MLA + RoPE + Flash + RMSNorm
GPT-5: MoE + proprietary attention + proprietary position
○ Gotchas & Common Mistakes¶
- ⚠️ MoE doesn't reduce model size: Total params are HUGE. MoE reduces ACTIVE params per token. You still need to fit ALL experts in memory.
- ⚠️ Flash Attention is exact: It's not an approximation. Same result as standard attention, just computed more efficiently.
- ⚠️ RoPE extension ≠ free: Extending context with RoPE scaling works but quality degrades beyond training length without fine-tuning.
- ⚠️ GQA grouping affects quality: Too few groups = quality drop. 8 groups for 64 heads is the typical sweet spot.
○ Interview Angles¶
- Q: What is Mixture of Experts and why does LLaMA 4 use it?
-
A: MoE has multiple "expert" FFN sub-networks per layer with a learned router. For each token, only top-K experts (e.g., 2 of 16) are activated. This gives the model capacity of the total parameters but computational cost of only the active experts. LLaMA 4 uses it to achieve 400B total params with only 17B active — massive capacity at manageable cost.
-
Q: What is GQA and how does it save memory?
- A: Grouped-Query Attention shares K and V heads across groups of Q heads. With 64 Q heads and 8 KV heads, the KV cache is 8x smaller than full MHA. This is critical for serving long-context models — KV cache can otherwise consume more memory than the model weights.
★ Code & Implementation¶
Compare SSM vs Transformer Throughput¶
# pip install torch>=2.3 mamba-ssm>=1.2 (mamba-ssm requires CUDA)
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3; mamba-ssm for Mamba models
# For CPU-only demo: use PyTorch baseline only
import torch, time
def benchmark_inference(model, input_ids, n_runs: int = 10) -> float:
"""Return median inference latency in ms."""
latencies = []
with torch.inference_mode():
for _ in range(n_runs):
start = time.monotonic()
model(input_ids)
latencies.append((time.monotonic() - start) * 1000)
latencies.sort()
return latencies[len(latencies) // 2] # median
# Transformer baseline (decoder-only, minimal)
import torch.nn as nn
class MiniTransformer(nn.Module):
def __init__(self, d=256, heads=4, layers=4, seq=512):
super().__init__()
self.embed = nn.Embedding(32000, d)
self.layers = nn.ModuleList([
nn.TransformerDecoderLayer(d, heads, batch_first=True)
for _ in range(layers)
])
self.head = nn.Linear(d, 32000)
def forward(self, x):
h = self.embed(x)
mem = torch.zeros_like(h) # dummy memory
for layer in self.layers:
h = layer(h, mem)
return self.head(h)
model = MiniTransformer()
ids = torch.randint(0, 32000, (1, 512))
lat = benchmark_inference(model, ids)
print(f"MiniTransformer (512 tokens): {lat:.1f}ms median")
# Note: quadratic scaling — try seq=1024, 2048 to see latency grow
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Transformers, Attention Mechanism, Linear Algebra For Ai |
| Leads to | Llms Overview, Inference Optimization |
| Compare with | Original Transformer (2017), RNNs (sequential) |
| Cross-domain | Computer architecture (memory hierarchy), Sparse computation |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Architecture-capability mismatch | Selected architecture underperforms on task type | Using encoder-only for generation, decoder-only for classification | Architecture selection guide: encoder for classification, decoder for generation |
| MoE routing collapse | Only 1-2 experts receive all tokens, others unused | Load balancing loss insufficient | Auxiliary load balancing loss, expert parallelism, capacity factors |
| Long-context degradation | Quality drops beyond pre-training context window | Architecture doesn't support position extrapolation | RoPE scaling, ALiBi, progressive context extension |
◆ Hands-On Exercises¶
Exercise 1: Compare Architecture Families on a Task¶
Goal: Run the same task through encoder-only, decoder-only, and encoder-decoder models Time: 30 minutes Steps: 1. Choose a summarization or classification task 2. Run with BERT (encoder), GPT-2 (decoder), T5 (enc-dec) 3. Compare output quality and inference speed 4. Document which architecture wins and why Expected Output: Architecture comparison table with quality scores and latency
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Gu & Dao "Mamba: Linear-Time Sequence Modeling" (2023) | State-space models challenging transformers |
| 📄 Paper | Touvron et al. "LLaMA" (2023) | Open-weight LLM architecture decisions explained |
| 🎥 Video | Yannic Kilcher — Architecture Breakdowns | Detailed paper walkthroughs of modern architectures |
| 📘 Book | "Build a Large Language Model (From Scratch)" by Sebastian Raschka (2024) | End-to-end architecture implementation |
★ Sources¶
- Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (2017)
- Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023)
- Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021)
- Dao, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022)
- Meta, "LLaMA 3 / LLaMA 4 Technical Reports" (2024-2025)
- Shazeer, "GLU Variants Improve Transformer" (SwiGLU, 2020)