Modern LLM Architectures¶

✨ Bit: The original Transformer (2017) is like a Model T Ford. Every modern LLM has upgraded every component — MoE for efficiency, GQA for memory, RoPE for position, Flash Attention for speed. Same soul, completely different car.

★ TL;DR¶

What: The architectural innovations that make modern LLMs (GPT-5, LLaMA 4, Gemini 3) work — beyond the basic Transformer
Why: Interviewers ask "what's MoE?" and "how does RoPE work?" These are the building blocks of every frontier model.
Key point: Modern LLMs aren't just bigger Transformers. They use MoE (activate only part of the model), GQA (save memory), RoPE (handle long sequences), and Flash Attention (go faster).

★ Overview¶

Definition¶

This document covers the key architectural components added to the basic Transformer to create modern LLMs. For the base Transformer architecture, see Transformers. For the attention mechanism, see Attention Mechanism.

Scope¶

Covers: MoE, GQA, RoPE, Flash Attention, normalization choices, and how they combine. Not a full paper review — focused on intuition and practical understanding.

Prerequisites¶

Transformers — encoder/decoder, self-attention
Attention Mechanism — Q, K, V matrices
Linear Algebra For Ai — matrix operations

★ Deep Dive¶

What Changed from the Original Transformer¶

Component	Original (2017)	Modern (2025)	Why
Experts	Dense (all params active)	MoE (sparse, subset active)	More capacity, less compute
Attention heads	Multi-Head (MHA)	GQA (grouped-query)	Less memory for KV cache
Position encoding	Sinusoidal (absolute)	RoPE (rotary)	Better at long sequences
Attention algorithm	Standard O(n²)	Flash Attention	2-4x faster, less memory
Normalization	Post-LayerNorm	Pre-RMSNorm	Stable training
Activation	ReLU	SiLU/SwiGLU	Smoother, better performance

1. Mixture of Experts (MoE)¶

DENSE MODEL (traditional):
  Every token goes through ALL parameters.
  LLaMA 70B: 70B params active per token → expensive!

MoE MODEL:
  Each layer has N "expert" sub-networks.
  A ROUTER decides which experts handle each token.
  Only 2-4 experts active per token (out of 16+).

  ┌─────────────────────────────────────────────────┐
  │              MoE LAYER                           │
  │                                                 │
  │  Input token                                    │
  │      │                                          │
  │      ▼                                          │
  │  ┌────────┐                                     │
  │  │ ROUTER │  ← "Which experts should handle    │
  │  │(Gating)│     this token?"                    │
  │  └───┬────┘                                     │
  │      │  top-k selection (usually k=2)           │
  │      ▼                                          │
  │  ┌──────┐ ┌──────┐ ┌──────┐ ... ┌──────┐       │
  │  │ E1 ✓ │ │ E2   │ │ E3 ✓ │     │ E16  │       │
  │  │active │ │sleep │ │active│     │sleep │       │
  │  └──┬───┘ └──────┘ └──┬───┘     └──────┘       │
  │     │                  │                        │
  │     ▼                  ▼                        │
  │  weighted combination of active expert outputs  │
  │      │                                          │
  │      ▼                                          │
  │  Output token                                   │
  └─────────────────────────────────────────────────┘

RESULT:
  Total params: 600B (huge capacity)
  Active params per token: 37B (cheap to run!)
  Best of both worlds: capacity of 600B, cost of ~37B

Model	Total Params	Active Params	Experts	Top-K
Mixtral 8x7B	47B	13B	8	2
LLaMA 4 Scout	109B	17B	16	1
LLaMA 4 Maverick	400B	17B	128	1
DeepSeek-V3	671B	37B	256	8
GPT-5 (estimated)	~1T+	~200-300B	MoE	Unknown

2. Grouped-Query Attention (GQA)¶

THE KV CACHE PROBLEM:
  In attention: Q, K, V matrices.
  During generation, K and V are CACHED for all past tokens.

  Multi-Head Attention (MHA):
    Each head has its own K and V matrices.
    32 heads × 128 dim × 4096 seq × 2 (K+V) = HUGE memory!

SOLUTIONS (progressive):

  MHA (Multi-Head Attention) — Original
  ┌───────────────────────────────────────┐
  │  Q1 K1 V1  │  Q2 K2 V2  │ ... │ Q32 K32 V32
  │  Each head has its own Q, K, V
  │  KV cache: 32 × 2 = 64 matrices
  └───────────────────────────────────────┘

  MQA (Multi-Query Attention) — Extreme
  ┌───────────────────────────────────────┐
  │  Q1  Q2  Q3 ... Q32   │  K  V
  │  All heads SHARE one K and one V
  │  KV cache: 1 × 2 = 2 matrices (32x less!)
  │  Problem: Quality drops
  └───────────────────────────────────────┘

  GQA (Grouped-Query Attention) — Sweet spot ✅
  ┌───────────────────────────────────────┐
  │  Group 1: Q1 Q2 Q3 Q4 → K₁ V₁       │
  │  Group 2: Q5 Q6 Q7 Q8 → K₂ V₂       │
  │  ...                                  │
  │  Group 8: Q29..Q32    → K₈ V₈        │
  │  KV cache: 8 × 2 = 16 matrices       │
  │  4x less memory, near-MHA quality!    │
  └───────────────────────────────────────┘

Used in: LLaMA 2/3/4, Gemini, Mistral — virtually every modern LLM.

3. RoPE (Rotary Position Embeddings)¶

THE POSITION PROBLEM:
  Transformers process all tokens in parallel (no sequential order).
  They need explicit position information: "this is token #5."

ORIGINAL: Sinusoidal (Absolute)
  Add a fixed vector per position: pos_1, pos_2, ..., pos_512
  ❌ Can't handle sequences longer than training length
  ❌ Doesn't capture relative distance well

RoPE: Rotary Position Embeddings
  Instead of ADDING position info, ROTATE the Q and K vectors
  by an angle proportional to their position.

  Key insight: After rotation, the DOT PRODUCT of Q·K
  naturally depends on RELATIVE position (distance between tokens).

  Token at position 5:  rotate Q by 5θ
  Token at position 10: rotate Q by 10θ
  Dot product captures: they're 5 positions apart

  ✅ Naturally handles relative positions
  ✅ Can be extended to longer sequences (NTK-aware scaling, YaRN)
  ✅ Computationally cheap (just rotation in pairs)

Used in: LLaMA 1/2/3/4, Mistral, Qwen, PaLM — standard in virtually all modern LLMs.

4. Flash Attention¶

THE SPEED PROBLEM:
  Standard attention: O(N²) in both time and memory.
  4096 tokens → 16 million attention scores → slow + memory-heavy

FLASH ATTENTION SOLUTION:
  Don't compute the full NxN attention matrix at once.
  Instead, compute it in TILES (blocks) that fit in GPU SRAM.

  GPU Memory Hierarchy:
  ┌────────────────────────────────────────┐
  │  SRAM (on-chip cache)  │  20 MB       │ ← FAST (10 TB/s)
  │  HBM (GPU RAM)         │  40-80 GB    │ ← SLOW (2 TB/s)
  └────────────────────────────────────────┘

  Standard attention: load full matrices from HBM → compute → store
  Flash attention:    compute in SRAM-sized tiles → never materialize
                      full attention matrix in HBM

  Result:
    2-4x faster
    Linear memory instead of quadratic
    Exact same output (not approximate!)

Versions: FlashAttention-1 (2022) → FlashAttention-2 (2023) → FlashAttention-3 (2024, Hopper GPUs)

5. Normalization & Activation¶

PRE-RMSNORM (replaces Post-LayerNorm):
  Original Transformer: Attention → Add → LayerNorm → FFN → Add → LayerNorm
  Modern LLM:           RMSNorm → Attention → Add → RMSNorm → FFN → Add

  RMSNorm = Root Mean Square Normalization
  Simpler than LayerNorm (no mean subtraction), faster, works better.

SwiGLU ACTIVATION (replaces ReLU):
  SwiGLU(x) = (x · W₁) ⊙ SiLU(x · V)

  Three matrices instead of two (W₁, W₂, V)
  Better performance empirically, standard in LLaMA/Mistral/GPT.

◆ How They Combine (LLaMA 3 Architecture)¶

LLaMA 3 70B — A complete modern LLM:

  Input → Tokenizer (BPE, 128K vocab)
       → Embedding lookup
       → [80 Transformer layers, each:]
           ├── Pre-RMSNorm
           ├── GQA Self-Attention (8 KV heads, 64 query heads)
           │   └── RoPE positional encoding
           │   └── Flash Attention computation
           ├── Residual connection
           ├── Pre-RMSNorm
           ├── SwiGLU Feed-Forward Network
           └── Residual connection
       → Final RMSNorm
       → Output head → Softmax → Next token probability

  Total: 70B parameters, 8K-128K context

◆ Quick Reference¶

COMPONENT CHEAT SHEET:
  MoE      → More capacity, less compute (sparse activation)
  GQA      → Less KV cache memory (grouped key-value sharing)
  RoPE     → Better position encoding (rotation-based, extensible)
  Flash    → Faster attention (tiled SRAM computation)
  RMSNorm  → Simpler, faster normalization
  SwiGLU   → Better activation function

WHICH MODELS USE WHAT:
  LLaMA 3:    GQA + RoPE + Flash + RMSNorm + SwiGLU
  LLaMA 4:    MoE + GQA + RoPE + Flash + RMSNorm + SwiGLU
  Mistral:    GQA + RoPE + Flash + RMSNorm + SwiGLU + Sliding Window
  Mixtral:    MoE + GQA + RoPE + Flash + RMSNorm + SwiGLU
  DeepSeek:   MoE + MLA + RoPE + Flash + RMSNorm
  GPT-5:      MoE + proprietary attention + proprietary position

○ Gotchas & Common Mistakes¶

⚠️ MoE doesn't reduce model size: Total params are HUGE. MoE reduces ACTIVE params per token. You still need to fit ALL experts in memory.
⚠️ Flash Attention is exact: It's not an approximation. Same result as standard attention, just computed more efficiently.
⚠️ RoPE extension ≠ free: Extending context with RoPE scaling works but quality degrades beyond training length without fine-tuning.
⚠️ GQA grouping affects quality: Too few groups = quality drop. 8 groups for 64 heads is the typical sweet spot.

○ Interview Angles¶

Q: What is Mixture of Experts and why does LLaMA 4 use it?
A: MoE has multiple "expert" FFN sub-networks per layer with a learned router. For each token, only top-K experts (e.g., 2 of 16) are activated. This gives the model capacity of the total parameters but computational cost of only the active experts. LLaMA 4 uses it to achieve 400B total params with only 17B active — massive capacity at manageable cost.
Q: What is GQA and how does it save memory?
A: Grouped-Query Attention shares K and V heads across groups of Q heads. With 64 Q heads and 8 KV heads, the KV cache is 8x smaller than full MHA. This is critical for serving long-context models — KV cache can otherwise consume more memory than the model weights.

★ Code & Implementation¶

Compare SSM vs Transformer Throughput¶

# pip install torch>=2.3 mamba-ssm>=1.2  (mamba-ssm requires CUDA)
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3; mamba-ssm for Mamba models
# For CPU-only demo: use PyTorch baseline only

import torch, time

def benchmark_inference(model, input_ids, n_runs: int = 10) -> float:
    """Return median inference latency in ms."""
    latencies = []
    with torch.inference_mode():
        for _ in range(n_runs):
            start = time.monotonic()
            model(input_ids)
            latencies.append((time.monotonic() - start) * 1000)
    latencies.sort()
    return latencies[len(latencies) // 2]  # median

# Transformer baseline (decoder-only, minimal)
import torch.nn as nn

class MiniTransformer(nn.Module):
    def __init__(self, d=256, heads=4, layers=4, seq=512):
        super().__init__()
        self.embed = nn.Embedding(32000, d)
        self.layers = nn.ModuleList([
            nn.TransformerDecoderLayer(d, heads, batch_first=True)
            for _ in range(layers)
        ])
        self.head = nn.Linear(d, 32000)

    def forward(self, x):
        h = self.embed(x)
        mem = torch.zeros_like(h)  # dummy memory
        for layer in self.layers:
            h = layer(h, mem)
        return self.head(h)

model = MiniTransformer()
ids   = torch.randint(0, 32000, (1, 512))
lat   = benchmark_inference(model, ids)
print(f"MiniTransformer (512 tokens): {lat:.1f}ms median")
# Note: quadratic scaling — try seq=1024, 2048 to see latency grow

★ Connections¶

Relationship	Topics
Builds on	Transformers, Attention Mechanism, Linear Algebra For Ai
Leads to	Llms Overview, Inference Optimization
Compare with	Original Transformer (2017), RNNs (sequential)
Cross-domain	Computer architecture (memory hierarchy), Sparse computation

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Architecture-capability mismatch	Selected architecture underperforms on task type	Using encoder-only for generation, decoder-only for classification	Architecture selection guide: encoder for classification, decoder for generation
MoE routing collapse	Only 1-2 experts receive all tokens, others unused	Load balancing loss insufficient	Auxiliary load balancing loss, expert parallelism, capacity factors
Long-context degradation	Quality drops beyond pre-training context window	Architecture doesn't support position extrapolation	RoPE scaling, ALiBi, progressive context extension

◆ Hands-On Exercises¶

Exercise 1: Compare Architecture Families on a Task¶

Goal: Run the same task through encoder-only, decoder-only, and encoder-decoder models Time: 30 minutes Steps: 1. Choose a summarization or classification task 2. Run with BERT (encoder), GPT-2 (decoder), T5 (enc-dec) 3. Compare output quality and inference speed 4. Document which architecture wins and why Expected Output: Architecture comparison table with quality scores and latency

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Gu & Dao "Mamba: Linear-Time Sequence Modeling" (2023)	State-space models challenging transformers
📄 Paper	Touvron et al. "LLaMA" (2023)	Open-weight LLM architecture decisions explained
🎥 Video	Yannic Kilcher — Architecture Breakdowns	Detailed paper walkthroughs of modern architectures
📘 Book	"Build a Large Language Model (From Scratch)" by Sebastian Raschka (2024)	End-to-end architecture implementation

★ Sources¶

Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (2017)
Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (2023)
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021)
Dao, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022)
Meta, "LLaMA 3 / LLaMA 4 Technical Reports" (2024-2025)
Shazeer, "GLU Variants Improve Transformer" (SwiGLU, 2020)