Inference Optimization¶
✨ Bit: Training a frontier LLM costs $100M+. Running it costs... well, also a LOT. Inference optimization is the difference between "cool demo" and "sustainable business." This is the deep tech that companies actually pay for.
★ TL;DR¶
- What: Techniques to make LLM inference faster, cheaper, and more memory-efficient without (significantly) hurting quality
- Why: Inference is where the money is spent (90%+ of LLM compute cost in production). This is THE skill for deep tech roles.
- Key point: Quantization (smaller numbers) + KV caching (don't recompute) + Speculative decoding (predict + verify) = orders of magnitude improvement.
★ Overview¶
Definition¶
Inference optimization covers all techniques that reduce the latency, memory, cost, or compute required to generate outputs from a trained LLM. Unlike training optimization (done once), inference optimization impacts every single request forever.
Scope¶
Covers: Quantization, KV cache, speculative decoding, batching, and architectural optimizations. For serving infrastructure, see Model Serving for LLM Applications and GenAI Tools & Infrastructure. For hardware foundations, see GPU & CUDA Programming for AI Engineers. For scaled serving topologies, see Distributed Inference & Serving Architecture.
Significance¶
- At scale, inference cost > training cost (by 3-10x)
- The reason you can run LLaMA 70B on a single GPU (quantization)
- The reason vLLM serves 10x more requests than naive serving (PagedAttention)
- Deep tech roles require this knowledge: not everyone who uses LLMs understands HOW to make them fast
Prerequisites¶
- Transformers — attention mechanism and KV computation
- Llms Overview — how generation works (autoregressive)
- Basic GPU/memory concepts
★ Deep Dive¶
Why Inference Is Slow¶
AUTOREGRESSIVE GENERATION IS SEQUENTIAL:
"The" → "capital" → "of" → "France" → "is" → "Paris" → "."
Each token requires a FULL forward pass through the model.
GPT-5.4 (frontier scale): Each forward pass = massive computation.
100-token response = 100 sequential forward passes.
TWO PHASES:
┌──────────────────────────────────────────────────────┐
│ PREFILL (process input) │
│ All input tokens processed in parallel. │
│ Compute-bound (lots of matrix math). │
│ One-time cost per request. │
├──────────────────────────────────────────────────────┤
│ DECODE (generate output) │
│ One token at a time, sequentially. │
│ Memory-bound (loading model weights from GPU RAM). │
│ Repeated for every output token. │
│ THIS IS THE BOTTLENECK. │
└──────────────────────────────────────────────────────┘
The Big Three Techniques¶
1. Quantization (Smaller Numbers = Less Memory)¶
CONCEPT:
FP32: 11000001 01001000 00000000 00000000 (32 bits per number)
FP16: 1100001 0010010 (16 bits — half the memory!)
INT8: 11001010 (8 bits — quarter the memory!)
INT4: 1100 (4 bits — 1/8 the memory!)
MEMORY IMPACT (LLaMA 70B):
FP32: 280 GB (7× A100 80GB)
FP16: 140 GB (2× A100 80GB)
INT8: 70 GB (1× A100 80GB)
INT4: 35 GB (1× RTX 4090 or A100 40GB!) ← This is why quantization matters
| Method | Type | How It Works | Quality Loss |
|---|---|---|---|
| GPTQ | PTQ (post-training) | Layer-by-layer quantization using calibration data | Low (< 1%) |
| AWQ | PTQ | Protects "salient" weights from quantization | Very low |
| GGUF | PTQ | CPU-friendly format used by llama.cpp | Varies by bits |
| QLoRA | QAT (quantize + train) | 4-bit base + LoRA adapters in 16-bit | Minimal |
| FP8 | Native hardware | Supported on Blackwell/Rubin GPUs (2025+) | Very low |
QUICK DECISION:
Running locally (consumer GPU)? → GGUF Q4 via Ollama/llama.cpp
Production serving on GPU? → AWQ or GPTQ via vLLM
Fine-tuning on limited GPU? → QLoRA (4-bit base + 16-bit LoRA)
Newest enterprise GPU (Blackwell)? → Native FP8
2. KV Cache (Don't Recompute Past Tokens)¶
WITHOUT KV CACHE:
Generate "Paris": Compute attention for ["The", "capital", "of", "France", "is"]
Generate ".": Compute attention for ["The", "capital", "of", "France", "is", "Paris"]
↑ Recomputed everything again! Wasteful.
WITH KV CACHE:
Generate "Paris": Compute K,V for all tokens → STORE in cache
Generate ".": Reuse cached K,V → Only compute for new token "Paris"
↑ 100x faster for long sequences!
PROBLEM: KV cache grows with sequence length × batch size:
LLaMA 70B, 4096 context, batch=32:
KV cache = ~40 GB of GPU memory just for the cache!
| KV Cache Technique | What It Does | Impact |
|---|---|---|
| PagedAttention (vLLM) | Paging like OS memory management | 2-4x more throughput |
| KV Cache Quantization | Compress cache to FP8/INT4 | 50-75% less cache memory |
| Prefix Caching | Share cache for common prefixes | Fewer recomputations |
| Sliding Window | Only cache recent N tokens | Bounded memory (Mistral) |
| Token Pruning/Eviction | Remove less important cached tokens | More capacity per request |
3. Speculative Decoding (Predict + Verify in Parallel)¶
NORMAL DECODING (slow):
Big Model generates: T1 → T2 → T3 → T4 → T5
Time: 5 sequential forward passes through the BIG model
SPECULATIVE DECODING (fast):
Step 1: Small draft model rapidly generates 5 candidate tokens:
T1, T2, T3, T4, T5 (5 fast forward passes)
Step 2: Big model verifies ALL 5 in ONE parallel forward pass:
"T1 ✓, T2 ✓, T3 ✓, T4 ✗ (wrong), T5 —"
Step 3: Accept T1, T2, T3. Regenerate from T4.
Result: 3 tokens verified in 1 big-model pass instead of 3.
Speedup: 2-3x with ZERO quality loss (mathematically proven).
┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Draft │───►│ Target │───►│ Accept/Reject│
│ Model │ │ Model │ │ verified │
│ (fast) │ │ (accurate)│ │ tokens │
└──────────┘ └──────────┘ └──────────────┘
Speculative Decoding Techniques (2025-2026):
| Technique | Method | Training Required | Typical Speedup | Deployment Maturity |
|---|---|---|---|---|
| EAGLE-3 | Draft head (2-5% of target model params), feature fusion, tree-based candidate verification | Yes (lightweight head training) | 2-3x | Production (vLLM, SGLang) |
| Medusa | Multiple prediction heads on target model, each predicts k tokens ahead | Yes (head fine-tuning) | 1.5-2.5x | Production |
| SPECTRA | Training-free, uses n-gram + small model ensemble for drafting | No | 1.5-2x | Research → Production |
| Self-Speculative | Model drafts from its own early layers (early exit) | No | 1.3-1.8x | Experimental |
ENABLING SPECULATIVE DECODING IN VLLM:
# Start vLLM with speculative decoding enabled
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-8B-Instruct \
--num-speculative-tokens 5 \
--use-v2-block-manager
# Key flags:
# --speculative-model Draft model (smaller, same tokenizer)
# --num-speculative-tokens How many tokens to draft per step (3-7)
# --speculative-draft-tensor-parallel-size TP for draft model
# SGLang equivalent:
python -m sglang.launch_server \
--model meta-llama/Llama-3.2-70B-Instruct \
--speculative-algorithm EAGLE \
--speculative-eagle-path eagle-head-weights/
Edge & On-Device Inference¶
Running LLMs on consumer hardware, mobile devices, and edge GPUs.
| Stack | Hardware | Best For | Key Feature |
|---|---|---|---|
| Ollama | macOS/Linux/Windows | Local LLM experimentation | One-command setup, model registry |
| MLX | Apple Silicon (M1-M4) | macOS-native inference | UMA advantage, Metal acceleration |
| llama.cpp / GGUF | CPU + optional GPU | Universal local inference | Pure C++, no Python dependencies |
| ExecuTorch | iOS / Android | Mobile deployment | PyTorch mobile runtime |
EDGE MODEL SELECTION (April 2026):
Device Memory Recommended Models Quantization
────────── ────────────────── ────────────
4 GB Gemma 4 E2B (2B), Phi-4-mini Q4_K_M
8 GB Gemma 4 E4B (4B), LLaMA 3.2 3B Q4_K_M / Q5_K_M
16 GB Gemma 4 26B MoE, Mistral 7B Q4_K_M
32 GB LLaMA 3.2 8B, Gemma 4 31B Q5_K_M / Q6_K
64 GB+ LLaMA 3.2 70B Q4_K_M
APPLE SILICON UMA ADVANTAGE:
CPU and GPU share the same memory pool (Unified Memory Architecture)
No PCIe transfer bottleneck → faster KV cache access
M4 Max (128GB) can run 70B models at Q4 comfortably
Other Key Techniques¶
| Technique | What | Impact |
|---|---|---|
| Continuous Batching | Dynamically add/remove requests from a batch | Better GPU utilization |
| Tensor Parallelism | Split model across multiple GPUs | Run models too large for 1 GPU |
| Pipeline Parallelism | Different layers on different GPUs | Reduce per-GPU memory |
| Flash Attention | Tiled attention computation | 2-4x faster attention |
| Knowledge Distillation | Train smaller model to mimic larger | Smaller, faster model |
| Pruning | Remove unimportant weights | Smaller model, some quality loss |
| MoE (Mixture of Experts) | Only activate subset of params per token | More capacity, less compute |
The Optimization Stack (Layer Them)¶
LEVEL 1: Architecture (MoE, GQA, Flash Attention) ← Built into model
LEVEL 2: Quantization (INT4/INT8/FP8) ← Compress model
LEVEL 3: Serving Engine (vLLM, TGI, SGLang) ← Efficient serving
LEVEL 4: KV Cache Optimization (PagedAttention, compression)← Memory management
LEVEL 5: Speculative Decoding (draft + verify) ← Speed up generation
LEVEL 6: Batching (continuous/dynamic) ← Maximize throughput
LEVEL 7: Hardware (Blackwell GPUs, custom ASICs) ← Raw performance
Stack them: 4-bit quant + vLLM + speculative decoding
= 10x+ cheaper than naive FP16 serving
SGLang: RadixAttention Architecture (2026 Standard)¶
SGLang vs vLLM: Same goal, different KV cache management strategy.
vLLM PagedAttention:
┌──────────────────────────────────────────────────┐
│ KV cache partitioned into fixed-size physical pages │
│ Pages allocated per request, freed on completion │
│ Best for: diverse prompts, simple batch serving │
└──────────────────────────────────────────────────┘
SGLang RadixAttention:
┌──────────────────────────────────────────────────┐
│ KV cache stored as a RADIX TREE (prefix trie) │
│ Common prefix → shared nodes across requests │
│ Automatic prefix matching at every request │
│ │
│ Request A: "System: You are a coder. User: Fix X" │
│ Request B: "System: You are a coder. User: Fix Y" │
│ └── Shared prefix cache ──┘ │
│ Cache hit on system prompt + instruction preamble │
│ Only compute the unique suffix (X vs Y) │
└──────────────────────────────────────────────────┘
When to use SGLang over vLLM: prefix cache hit rate > 30% (RAG systems, multi-turn chat, agentic workloads with shared tool schemas). For diverse prompts with < 10% cache hits, vLLM's wider ecosystem often wins.
| Workload | vLLM | SGLang | Reason |
|---|---|---|---|
| Simple batch Q&A (diverse prompts) | ✅ | Good | Both work; vLLM has wider ecosystem |
| RAG pipeline (shared system prompt) | ⚠️ Manual | ✅ | RadixAttention auto-shares prefix context |
| Multi-turn chat (shared history) | ⚠️ | ✅ | Radix tree naturally stores conversation cache |
| Agentic loops (shared tool schemas) | ⚠️ | ✅ | High-repetition prefix = massive cache hits |
Prefill/Decode (P/D) Disaggregation¶
TRADITIONAL: One GPU handles BOTH phases
┌──────────────────────────────────────────────┐
│ GPU Node │
│ Prefill (compute-bound) ► Decode (memory-bound) │
│ Decode blocks new prefill requests! │
└──────────────────────────────────────────────┘
P/D DISAGGREGATION: Specialized node clusters
┌──────────────────────────────────────────────┐
│ PREFILL NODES KV xfer DECODE NODES │
│ ┌──────────┐ │ ┌──────────┐ │
│ │High-FLOP │────────►│ │High-BW │ │
│ │GPUs(H100)│ │ │GPUs(H200) │ │
│ └──────────┘ │ └──────────┘ │
│ Maximize FLOPS cache Maximize BW │
└──────────────────────────────────────────────┘
Result: 2-4× higher cluster throughput at same cost.
Implementations (2026): Distserve (OSDI 2024), vLLM KVTransfer API, NVIDIA Dynamo serving stack.
Speculative Decoding — 2026 State of the Art¶
| Method | Key Innovation | Speedup |
|---|---|---|
| Standard speculative | Small draft + large verify | 2-3× |
| EAGLE-3 (2025) | Draft from target's intermediate layers; no separate model | 3-4× |
| P-EAGLE | EAGLE + P/D disaggregation | Up to 5× |
| TurboSpec | Dynamic disabling when acceptance rate drops | Avoids negative speedup |
| SPECTRA (ACL 2025) | Training-free; any draft + any target | 2-2.5× |
EAGLE-3 is the 2026 default: zero memory overhead, no separate model to maintain, 3-4× real-world speedup.
◆ Formulas & Equations¶
| Name | Formula/Concept | Use |
|---|---|---|
| Memory (model weights) | $$\text{Memory} = \text{Params} \times \text{Bytes per param}$$ | FP16: 2 bytes, INT4: 0.5 bytes |
| KV cache size | $$\text{KV} = 2 \times L \times H \times D \times S \times B \times \text{precision}$$ | L=layers, H=heads, D=dim, S=seq_len, B=batch |
| Throughput | $$\text{tokens/second} = \frac{\text{batch_size}}{\text{latency_per_token}}$$ | What you optimize for at scale |
◆ Quick Reference¶
OPTIMIZATION PRIORITY ORDER (start here):
1. Right-size the model (don't use 70B if 8B works)
2. Quantize (INT4/INT8 via AWQ/GPTQ)
3. Use vLLM or TGI (not naive inference)
4. Enable KV cache optimizations
5. Add speculative decoding (if latency-critical)
6. Scale with tensor parallelism (multi-GPU)
MEMORY QUICK CALC:
Model params × bytes per param = weight memory
+ KV cache per request × batch size = cache memory
Total GPU memory needed = weights + cache + overhead (~20%)
LATENCY TARGETS (typical):
Interactive chat: < 100ms time-to-first-token
Streaming: 30-50 tokens/sec generation
Batch processing: Maximize throughput, latency less critical
○ Gotchas & Common Mistakes¶
- ⚠️ Quantization isn't free: INT4 CAN degrade quality for complex reasoning. Always benchmark YOUR use case.
- ⚠️ KV cache OOM: Long contexts + large batches = KV cache eats all GPU memory. Monitor and limit.
- ⚠️ Speculative decoding overhead: If the draft model is too slow or inaccurate, speedup disappears. Draft model must be much smaller AND accurate.
- ⚠️ vLLM ≠ magic: You still need to tune batch sizes, GPU memory allocation, and scheduling for your workload.
- ⚠️ Latency vs throughput tradeoff: Optimizing for one often hurts the other. Know which matters for your use case.
- ⚠️ SGLang RadixAttention thrash: If prompts are highly diverse, the radix tree evicts cache aggressively and you lose the benefit. Profile cache hit rate before committing.
- ⚠️ P/D disaggregation complexity: Adds network hop for KV cache transfer. Requires high-bandwidth interconnect (NVLink, InfiniBand) to avoid bottleneck.
★ Code & Implementation¶
Start vLLM Server (OpenAI-Compatible)¶
# pip install vllm>=0.8
# ⚠️ Last tested: 2026-04 | Requires: vllm>=0.8, CUDA GPU
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-2-2b-it \
--dtype bfloat16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85
# Test via OpenAI client:
# from openai import OpenAI
# client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
# r = client.chat.completions.create(model="google/gemma-2-2b-it",
# messages=[{"role": "user", "content": "Explain KV caching."}])
# print(r.choices[0].message.content)
Start SGLang Server (RadixAttention — Best for RAG/Agentic)¶
# pip install sglang[all]>=0.4.0
# ⚠️ Last tested: 2026-04 | Requires: sglang>=0.4, CUDA GPU
python -m sglang.launch_server \
--model-path google/gemma-2-2b-it \
--port 30000 \
--mem-fraction-static 0.85
# RadixAttention prefix caching is automatic — no extra config needed
# High prefix cache hit rate (RAG, agents) → SGLang outperforms vLLM
Load Model with 4-bit Quantization¶
# pip install transformers>=4.40 bitsandbytes>=0.43
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40, bitsandbytes>=0.43, CUDA GPU
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4", # NF4 = best quality for 4-bit
bnb_4bit_use_double_quant=True, # Double quant reduces metadata overhead
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-2b-it",
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
mem_gb = torch.cuda.memory_allocated() / (1024**3)
print(f"Model loaded in {mem_gb:.1f} GB (4-bit quantized)")
inputs = tokenizer("Explain quantization in one sentence:", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# LLaMA-70B: 140 GB in FP16 -> ~35 GB with 4-bit (fits on single A100!)
○ Interview Angles¶
- Q: How does quantization make LLMs run on consumer hardware?
-
A: By representing model weights in fewer bits (INT4 = 4 bits vs FP16 = 16 bits), memory drops 4x. LLaMA 70B goes from 140GB (needs 2× A100) to 35GB (fits on 1× RTX 4090). Modern quantization methods (AWQ, GPTQ) preserve quality by protecting important weights and using calibration data.
-
Q: Explain speculative decoding.
-
A: A small draft model rapidly generates N candidate tokens. The large target model verifies all N in a single parallel forward pass (since verification is parallelizable). Accepted tokens are kept, rejected ones trigger regeneration. Provably lossless (same distribution as target model) with 2-3x speedup.
-
Q: What is PagedAttention and why does vLLM use it?
- A: Like OS virtual memory paging. KV cache is stored in non-contiguous memory blocks ("pages") instead of one contiguous block. This eliminates fragmentation, allows dynamic memory allocation, and enables sharing cache between requests with common prefixes. Result: 2-4x higher throughput.
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Transformers, Llms Overview |
| Leads to | Production LLM deployment, Cost optimization, Edge AI |
| Compare with | Training optimization (different phase), Model compression (overlapping) |
| Cross-domain | Computer architecture (memory hierarchy), OS (paging), Compiler optimization |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Quantization quality cliff | INT4 model produces gibberish on certain inputs | Weight outliers in specific layers | Per-channel quantization (GPTQ), SmoothQuant, mixed precision |
| Speculative decoding mismatch | Draft model rejects >50% of tokens; no speedup | Draft model too different from target | Tune acceptance threshold; try EAGLE-3 (no separate draft model) |
| KV-cache OOM at batch | Works for 1 request, OOM at batch size 8 | KV-cache scales linearly with batch size | Paged attention (vLLM), GQA/MQA, KV-cache quantization |
| Continuous batching stalls | Throughput plateaus despite available GPU | Short sequences blocking slots for long ones | Preemptive scheduling, iteration-level batching |
| P/D disaggregation KV transfer latency | Low throughput despite specialized nodes | KV cache transfer between prefill/decode nodes bottlenecks | High-bandwidth interconnect (NVLink, InfiniBand); compress KV before transfer |
| RadixAttention cache thrash | High eviction rate; low cache hit ratio | Extremely diverse prompts exceed cache budget | Increase SGLang cache size; or switch to vLLM for truly diverse workloads |
◆ Hands-On Exercises¶
Exercise 1: Benchmark Quantization Levels¶
Goal: Compare FP16, INT8, INT4 on latency, throughput, and quality Time: 30 minutes Steps: 1. Load the same model at FP16, INT8 (bitsandbytes), and INT4 (GPTQ) 2. Run 50 inference samples at each precision 3. Measure tokens/second, memory usage, and output quality (BLEU or LLM-judge) 4. Plot the Pareto frontier of quality vs speed Expected Output: Quality/speed tradeoff chart across precisions
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Dao et al. "FlashAttention-2" (2023) | 2× faster attention — essential for production serving |
| 📄 Paper | Leviathan et al. "Speculative Decoding" (2022) | Accelerate decoding without quality loss |
| 📄 Paper | Zheng et al. "SGLang" (2024) | RadixAttention and efficient LLM serving |
| 📄 Paper | Li et al. "Distserve" (2024) | Prefill/Decode disaggregation architecture |
| 🔧 Hands-on | vLLM Documentation | Production inference optimization in practice |
| 🔧 Hands-on | SGLang Documentation | RadixAttention and efficient serving |
| 📘 Book | "Efficient Deep Learning" by Menghani (2024) | Comprehensive treatment of inference optimization techniques |
★ Sources¶
- Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)
- Lin et al., "AWQ: Activation-aware Weight Quantization" (2023)
- Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" (2023)
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (vLLM, 2023)
- Dao, "FlashAttention" (2022) — https://arxiv.org/abs/2205.14135
- Zheng et al., "SGLang: Efficient Execution of Structured Language Model Programs" (2024) — https://arxiv.org/abs/2312.07104
- Li et al., "Distserve: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving" (2024) — https://arxiv.org/abs/2401.09670
- EAGLE-3: https://arxiv.org/abs/2503.01840
- vLLM documentation — https://docs.vllm.ai
- SGLang documentation — https://sgl-project.github.io