Inference Optimization¶

✨ Bit: Training a frontier LLM costs $100M+. Running it costs... well, also a LOT. Inference optimization is the difference between "cool demo" and "sustainable business." This is the deep tech that companies actually pay for.

★ TL;DR¶

What: Techniques to make LLM inference faster, cheaper, and more memory-efficient without (significantly) hurting quality
Why: Inference is where the money is spent (90%+ of LLM compute cost in production). This is THE skill for deep tech roles.
Key point: Quantization (smaller numbers) + KV caching (don't recompute) + Speculative decoding (predict + verify) = orders of magnitude improvement.

★ Overview¶

Definition¶

Inference optimization covers all techniques that reduce the latency, memory, cost, or compute required to generate outputs from a trained LLM. Unlike training optimization (done once), inference optimization impacts every single request forever.

Scope¶

Covers: Quantization, KV cache, speculative decoding, batching, and architectural optimizations. For serving infrastructure, see Model Serving for LLM Applications and GenAI Tools & Infrastructure. For hardware foundations, see GPU & CUDA Programming for AI Engineers. For scaled serving topologies, see Distributed Inference & Serving Architecture.

Significance¶

At scale, inference cost > training cost (by 3-10x)
The reason you can run LLaMA 70B on a single GPU (quantization)
The reason vLLM serves 10x more requests than naive serving (PagedAttention)
Deep tech roles require this knowledge: not everyone who uses LLMs understands HOW to make them fast

Prerequisites¶

Transformers — attention mechanism and KV computation
Llms Overview — how generation works (autoregressive)
Basic GPU/memory concepts

★ Deep Dive¶

Why Inference Is Slow¶

AUTOREGRESSIVE GENERATION IS SEQUENTIAL:
  "The" → "capital" → "of" → "France" → "is" → "Paris" → "."

  Each token requires a FULL forward pass through the model.
  GPT-5.4 (frontier scale): Each forward pass = massive computation.
  100-token response = 100 sequential forward passes.

TWO PHASES:
  ┌──────────────────────────────────────────────────────┐
  │ PREFILL (process input)                              │
  │   All input tokens processed in parallel.            │
  │   Compute-bound (lots of matrix math).               │
  │   One-time cost per request.                         │
  ├──────────────────────────────────────────────────────┤
  │ DECODE (generate output)                             │
  │   One token at a time, sequentially.                 │
  │   Memory-bound (loading model weights from GPU RAM). │
  │   Repeated for every output token.                   │
  │   THIS IS THE BOTTLENECK.                            │
  └──────────────────────────────────────────────────────┘

The Big Three Techniques¶

1. Quantization (Smaller Numbers = Less Memory)¶

CONCEPT:
  FP32: 11000001 01001000 00000000 00000000  (32 bits per number)
  FP16: 1100001 0010010                       (16 bits — half the memory!)
  INT8: 11001010                              (8 bits — quarter the memory!)
  INT4: 1100                                  (4 bits — 1/8 the memory!)

MEMORY IMPACT (LLaMA 70B):
  FP32:  280 GB  (7× A100 80GB)
  FP16:  140 GB  (2× A100 80GB)
  INT8:   70 GB  (1× A100 80GB)
  INT4:   35 GB  (1× RTX 4090 or A100 40GB!)  ← This is why quantization matters

Method	Type	How It Works	Quality Loss
GPTQ	PTQ (post-training)	Layer-by-layer quantization using calibration data	Low (< 1%)
AWQ	PTQ	Protects "salient" weights from quantization	Very low
GGUF	PTQ	CPU-friendly format used by llama.cpp	Varies by bits
QLoRA	QAT (quantize + train)	4-bit base + LoRA adapters in 16-bit	Minimal
FP8	Native hardware	Supported on Blackwell/Rubin GPUs (2025+)	Very low

QUICK DECISION:
  Running locally (consumer GPU)?     → GGUF Q4 via Ollama/llama.cpp
  Production serving on GPU?          → AWQ or GPTQ via vLLM
  Fine-tuning on limited GPU?         → QLoRA (4-bit base + 16-bit LoRA)
  Newest enterprise GPU (Blackwell)?  → Native FP8

2. KV Cache (Don't Recompute Past Tokens)¶

WITHOUT KV CACHE:
  Generate "Paris":  Compute attention for ["The", "capital", "of", "France", "is"]
  Generate ".":      Compute attention for ["The", "capital", "of", "France", "is", "Paris"]
                     ↑ Recomputed everything again! Wasteful.

WITH KV CACHE:
  Generate "Paris":  Compute K,V for all tokens → STORE in cache
  Generate ".":      Reuse cached K,V → Only compute for new token "Paris"
                     ↑ 100x faster for long sequences!

PROBLEM: KV cache grows with sequence length × batch size:
  LLaMA 70B, 4096 context, batch=32:
    KV cache = ~40 GB of GPU memory just for the cache!

KV Cache Technique	What It Does	Impact
PagedAttention (vLLM)	Paging like OS memory management	2-4x more throughput
KV Cache Quantization	Compress cache to FP8/INT4	50-75% less cache memory
Prefix Caching	Share cache for common prefixes	Fewer recomputations
Sliding Window	Only cache recent N tokens	Bounded memory (Mistral)
Token Pruning/Eviction	Remove less important cached tokens	More capacity per request

3. Speculative Decoding (Predict + Verify in Parallel)¶

NORMAL DECODING (slow):
  Big Model generates: T1 → T2 → T3 → T4 → T5
  Time: 5 sequential forward passes through the BIG model

SPECULATIVE DECODING (fast):
  Step 1: Small draft model rapidly generates 5 candidate tokens:
          T1, T2, T3, T4, T5  (5 fast forward passes)

  Step 2: Big model verifies ALL 5 in ONE parallel forward pass:
          "T1 ✓, T2 ✓, T3 ✓, T4 ✗ (wrong), T5 —"

  Step 3: Accept T1, T2, T3. Regenerate from T4.

  Result: 3 tokens verified in 1 big-model pass instead of 3.
  Speedup: 2-3x with ZERO quality loss (mathematically proven).

  ┌──────────┐    ┌──────────┐    ┌──────────────┐
  │  Draft   │───►│  Target  │───►│ Accept/Reject│
  │  Model   │    │  Model   │    │ verified     │
  │  (fast)  │    │ (accurate)│   │ tokens       │
  └──────────┘    └──────────┘    └──────────────┘

Speculative Decoding Techniques (2025-2026):

Technique	Method	Training Required	Typical Speedup	Deployment Maturity
EAGLE-3	Draft head (2-5% of target model params), feature fusion, tree-based candidate verification	Yes (lightweight head training)	2-3x	Production (vLLM, SGLang)
Medusa	Multiple prediction heads on target model, each predicts k tokens ahead	Yes (head fine-tuning)	1.5-2.5x	Production
SPECTRA	Training-free, uses n-gram + small model ensemble for drafting	No	1.5-2x	Research → Production
Self-Speculative	Model drafts from its own early layers (early exit)	No	1.3-1.8x	Experimental

ENABLING SPECULATIVE DECODING IN VLLM:

  # Start vLLM with speculative decoding enabled
  python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-8B-Instruct \
    --num-speculative-tokens 5 \
    --use-v2-block-manager

  # Key flags:
  #   --speculative-model       Draft model (smaller, same tokenizer)
  #   --num-speculative-tokens   How many tokens to draft per step (3-7)
  #   --speculative-draft-tensor-parallel-size  TP for draft model

  # SGLang equivalent:
  python -m sglang.launch_server \
    --model meta-llama/Llama-3.2-70B-Instruct \
    --speculative-algorithm EAGLE \
    --speculative-eagle-path eagle-head-weights/

Edge & On-Device Inference¶

Running LLMs on consumer hardware, mobile devices, and edge GPUs.

Stack	Hardware	Best For	Key Feature
Ollama	macOS/Linux/Windows	Local LLM experimentation	One-command setup, model registry
MLX	Apple Silicon (M1-M4)	macOS-native inference	UMA advantage, Metal acceleration
llama.cpp / GGUF	CPU + optional GPU	Universal local inference	Pure C++, no Python dependencies
ExecuTorch	iOS / Android	Mobile deployment	PyTorch mobile runtime

EDGE MODEL SELECTION (April 2026):

  Device Memory    Recommended Models              Quantization
  ──────────       ──────────────────              ────────────
  4 GB             Gemma 4 E2B (2B), Phi-4-mini    Q4_K_M
  8 GB             Gemma 4 E4B (4B), LLaMA 3.2 3B  Q4_K_M / Q5_K_M
  16 GB            Gemma 4 26B MoE, Mistral 7B     Q4_K_M
  32 GB            LLaMA 3.2 8B, Gemma 4 31B       Q5_K_M / Q6_K
  64 GB+           LLaMA 3.2 70B                   Q4_K_M

  APPLE SILICON UMA ADVANTAGE:
    CPU and GPU share the same memory pool (Unified Memory Architecture)
    No PCIe transfer bottleneck → faster KV cache access
    M4 Max (128GB) can run 70B models at Q4 comfortably

Other Key Techniques¶

Technique	What	Impact
Continuous Batching	Dynamically add/remove requests from a batch	Better GPU utilization
Tensor Parallelism	Split model across multiple GPUs	Run models too large for 1 GPU
Pipeline Parallelism	Different layers on different GPUs	Reduce per-GPU memory
Flash Attention	Tiled attention computation	2-4x faster attention
Knowledge Distillation	Train smaller model to mimic larger	Smaller, faster model
Pruning	Remove unimportant weights	Smaller model, some quality loss
MoE (Mixture of Experts)	Only activate subset of params per token	More capacity, less compute

The Optimization Stack (Layer Them)¶

LEVEL 1: Architecture (MoE, GQA, Flash Attention)         ← Built into model
LEVEL 2: Quantization (INT4/INT8/FP8)                     ← Compress model
LEVEL 3: Serving Engine (vLLM, TGI, SGLang)               ← Efficient serving
LEVEL 4: KV Cache Optimization (PagedAttention, compression)← Memory management
LEVEL 5: Speculative Decoding (draft + verify)             ← Speed up generation
LEVEL 6: Batching (continuous/dynamic)                     ← Maximize throughput
LEVEL 7: Hardware (Blackwell GPUs, custom ASICs)           ← Raw performance

Stack them: 4-bit quant + vLLM + speculative decoding
            = 10x+ cheaper than naive FP16 serving

SGLang: RadixAttention Architecture (2026 Standard)¶

SGLang vs vLLM: Same goal, different KV cache management strategy.

vLLM PagedAttention:
┌──────────────────────────────────────────────────┐
│  KV cache partitioned into fixed-size physical pages │
│  Pages allocated per request, freed on completion    │
│  Best for: diverse prompts, simple batch serving     │
└──────────────────────────────────────────────────┘

SGLang RadixAttention:
┌──────────────────────────────────────────────────┐
│  KV cache stored as a RADIX TREE (prefix trie)       │
│  Common prefix → shared nodes across requests        │
│  Automatic prefix matching at every request          │
│                                                      │
│  Request A: "System: You are a coder. User: Fix X"  │
│  Request B: "System: You are a coder. User: Fix Y"  │
│               └── Shared prefix cache ──┘            │
│  Cache hit on system prompt + instruction preamble   │
│  Only compute the unique suffix (X vs Y)             │
└──────────────────────────────────────────────────┘

When to use SGLang over vLLM: prefix cache hit rate > 30% (RAG systems, multi-turn chat, agentic workloads with shared tool schemas). For diverse prompts with < 10% cache hits, vLLM's wider ecosystem often wins.

Workload	vLLM	SGLang	Reason
Simple batch Q&A (diverse prompts)	✅	Good	Both work; vLLM has wider ecosystem
RAG pipeline (shared system prompt)	⚠️ Manual	✅	RadixAttention auto-shares prefix context
Multi-turn chat (shared history)	⚠️	✅	Radix tree naturally stores conversation cache
Agentic loops (shared tool schemas)	⚠️	✅	High-repetition prefix = massive cache hits

Prefill/Decode (P/D) Disaggregation¶

TRADITIONAL: One GPU handles BOTH phases
┌──────────────────────────────────────────────┐
│  GPU Node                                        │
│  Prefill (compute-bound) ► Decode (memory-bound)  │
│  Decode blocks new prefill requests!             │
└──────────────────────────────────────────────┘

P/D DISAGGREGATION: Specialized node clusters
┌──────────────────────────────────────────────┐
│  PREFILL NODES         KV xfer    DECODE NODES     │
│  ┌──────────┐            │       ┌──────────┐  │
│  │High-FLOP │────────►│        │High-BW   │  │
│  │GPUs(H100)│            │        │GPUs(H200) │  │
│  └──────────┘            │        └──────────┘  │
│  Maximize FLOPS        cache      Maximize BW      │
└──────────────────────────────────────────────┘
Result: 2-4× higher cluster throughput at same cost.

Implementations (2026): Distserve (OSDI 2024), vLLM KVTransfer API, NVIDIA Dynamo serving stack.

Speculative Decoding — 2026 State of the Art¶

Method	Key Innovation	Speedup
Standard speculative	Small draft + large verify	2-3×
EAGLE-3 (2025)	Draft from target's intermediate layers; no separate model	3-4×
P-EAGLE	EAGLE + P/D disaggregation	Up to 5×
TurboSpec	Dynamic disabling when acceptance rate drops	Avoids negative speedup
SPECTRA (ACL 2025)	Training-free; any draft + any target	2-2.5×

EAGLE-3 is the 2026 default: zero memory overhead, no separate model to maintain, 3-4× real-world speedup.

◆ Formulas & Equations¶

Name	Formula/Concept	Use
Memory (model weights)	$$\text{Memory} = \text{Params} \times \text{Bytes per param}$$	FP16: 2 bytes, INT4: 0.5 bytes
KV cache size	$$\text{KV} = 2 \times L \times H \times D \times S \times B \times \text{precision}$$	L=layers, H=heads, D=dim, S=seq_len, B=batch
Throughput	$$\text{tokens/second} = \frac{\text{batch_size}}{\text{latency_per_token}}$$	What you optimize for at scale

◆ Quick Reference¶

OPTIMIZATION PRIORITY ORDER (start here):
  1. Right-size the model (don't use 70B if 8B works)
  2. Quantize (INT4/INT8 via AWQ/GPTQ)
  3. Use vLLM or TGI (not naive inference)
  4. Enable KV cache optimizations
  5. Add speculative decoding (if latency-critical)
  6. Scale with tensor parallelism (multi-GPU)

MEMORY QUICK CALC:
  Model params × bytes per param = weight memory
  + KV cache per request × batch size = cache memory
  Total GPU memory needed = weights + cache + overhead (~20%)

LATENCY TARGETS (typical):
  Interactive chat: < 100ms time-to-first-token
  Streaming: 30-50 tokens/sec generation
  Batch processing: Maximize throughput, latency less critical

○ Gotchas & Common Mistakes¶

⚠️ Quantization isn't free: INT4 CAN degrade quality for complex reasoning. Always benchmark YOUR use case.
⚠️ KV cache OOM: Long contexts + large batches = KV cache eats all GPU memory. Monitor and limit.
⚠️ Speculative decoding overhead: If the draft model is too slow or inaccurate, speedup disappears. Draft model must be much smaller AND accurate.
⚠️ vLLM ≠ magic: You still need to tune batch sizes, GPU memory allocation, and scheduling for your workload.
⚠️ Latency vs throughput tradeoff: Optimizing for one often hurts the other. Know which matters for your use case.
⚠️ SGLang RadixAttention thrash: If prompts are highly diverse, the radix tree evicts cache aggressively and you lose the benefit. Profile cache hit rate before committing.
⚠️ P/D disaggregation complexity: Adds network hop for KV cache transfer. Requires high-bandwidth interconnect (NVLink, InfiniBand) to avoid bottleneck.

★ Code & Implementation¶

Start vLLM Server (OpenAI-Compatible)¶

# pip install vllm>=0.8
# ⚠️ Last tested: 2026-04 | Requires: vllm>=0.8, CUDA GPU

python -m vllm.entrypoints.openai.api_server \
  --model google/gemma-2-2b-it \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

# Test via OpenAI client:
# from openai import OpenAI
# client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
# r = client.chat.completions.create(model="google/gemma-2-2b-it",
#       messages=[{"role": "user", "content": "Explain KV caching."}])
# print(r.choices[0].message.content)

Start SGLang Server (RadixAttention — Best for RAG/Agentic)¶

# pip install sglang[all]>=0.4.0
# ⚠️ Last tested: 2026-04 | Requires: sglang>=0.4, CUDA GPU

python -m sglang.launch_server \
  --model-path google/gemma-2-2b-it \
  --port 30000 \
  --mem-fraction-static 0.85
# RadixAttention prefix caching is automatic — no extra config needed
# High prefix cache hit rate (RAG, agents) → SGLang outperforms vLLM

Load Model with 4-bit Quantization¶

# pip install transformers>=4.40 bitsandbytes>=0.43
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40, bitsandbytes>=0.43, CUDA GPU

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",        # NF4 = best quality for 4-bit
    bnb_4bit_use_double_quant=True,   # Double quant reduces metadata overhead
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
mem_gb = torch.cuda.memory_allocated() / (1024**3)
print(f"Model loaded in {mem_gb:.1f} GB (4-bit quantized)")

inputs = tokenizer("Explain quantization in one sentence:", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# LLaMA-70B: 140 GB in FP16 -> ~35 GB with 4-bit (fits on single A100!)

○ Interview Angles¶

Q: How does quantization make LLMs run on consumer hardware?
A: By representing model weights in fewer bits (INT4 = 4 bits vs FP16 = 16 bits), memory drops 4x. LLaMA 70B goes from 140GB (needs 2× A100) to 35GB (fits on 1× RTX 4090). Modern quantization methods (AWQ, GPTQ) preserve quality by protecting important weights and using calibration data.
Q: Explain speculative decoding.
A: A small draft model rapidly generates N candidate tokens. The large target model verifies all N in a single parallel forward pass (since verification is parallelizable). Accepted tokens are kept, rejected ones trigger regeneration. Provably lossless (same distribution as target model) with 2-3x speedup.
Q: What is PagedAttention and why does vLLM use it?
A: Like OS virtual memory paging. KV cache is stored in non-contiguous memory blocks ("pages") instead of one contiguous block. This eliminates fragmentation, allows dynamic memory allocation, and enables sharing cache between requests with common prefixes. Result: 2-4x higher throughput.

★ Connections¶

Relationship	Topics
Builds on	Transformers, Llms Overview
Leads to	Production LLM deployment, Cost optimization, Edge AI
Compare with	Training optimization (different phase), Model compression (overlapping)
Cross-domain	Computer architecture (memory hierarchy), OS (paging), Compiler optimization

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Quantization quality cliff	INT4 model produces gibberish on certain inputs	Weight outliers in specific layers	Per-channel quantization (GPTQ), SmoothQuant, mixed precision
Speculative decoding mismatch	Draft model rejects >50% of tokens; no speedup	Draft model too different from target	Tune acceptance threshold; try EAGLE-3 (no separate draft model)
KV-cache OOM at batch	Works for 1 request, OOM at batch size 8	KV-cache scales linearly with batch size	Paged attention (vLLM), GQA/MQA, KV-cache quantization
Continuous batching stalls	Throughput plateaus despite available GPU	Short sequences blocking slots for long ones	Preemptive scheduling, iteration-level batching
P/D disaggregation KV transfer latency	Low throughput despite specialized nodes	KV cache transfer between prefill/decode nodes bottlenecks	High-bandwidth interconnect (NVLink, InfiniBand); compress KV before transfer
RadixAttention cache thrash	High eviction rate; low cache hit ratio	Extremely diverse prompts exceed cache budget	Increase SGLang cache size; or switch to vLLM for truly diverse workloads

◆ Hands-On Exercises¶

Exercise 1: Benchmark Quantization Levels¶

Goal: Compare FP16, INT8, INT4 on latency, throughput, and quality Time: 30 minutes Steps: 1. Load the same model at FP16, INT8 (bitsandbytes), and INT4 (GPTQ) 2. Run 50 inference samples at each precision 3. Measure tokens/second, memory usage, and output quality (BLEU or LLM-judge) 4. Plot the Pareto frontier of quality vs speed Expected Output: Quality/speed tradeoff chart across precisions

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Dao et al. "FlashAttention-2" (2023)	2× faster attention — essential for production serving
📄 Paper	Leviathan et al. "Speculative Decoding" (2022)	Accelerate decoding without quality loss
📄 Paper	Zheng et al. "SGLang" (2024)	RadixAttention and efficient LLM serving
📄 Paper	Li et al. "Distserve" (2024)	Prefill/Decode disaggregation architecture
🔧 Hands-on	vLLM Documentation	Production inference optimization in practice
🔧 Hands-on	SGLang Documentation	RadixAttention and efficient serving
📘 Book	"Efficient Deep Learning" by Menghani (2024)	Comprehensive treatment of inference optimization techniques

★ Sources¶

Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)
Lin et al., "AWQ: Activation-aware Weight Quantization" (2023)
Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" (2023)
Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (vLLM, 2023)
Dao, "FlashAttention" (2022) — https://arxiv.org/abs/2205.14135
Zheng et al., "SGLang: Efficient Execution of Structured Language Model Programs" (2024) — https://arxiv.org/abs/2312.07104
Li et al., "Distserve: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving" (2024) — https://arxiv.org/abs/2401.09670
EAGLE-3: https://arxiv.org/abs/2503.01840
vLLM documentation — https://docs.vllm.ai
SGLang documentation — https://sgl-project.github.io