Distributed Inference & Serving Architecture¶
Serving one model call is straightforward. Serving many users, many models, and long contexts at acceptable latency is an architecture problem.
★ TL;DR¶
- What: The system design patterns used to scale inference across multiple machines and accelerators.
- Why: Large-model serving hits limits in memory, concurrency, and cost that one process or one GPU cannot handle cleanly.
- Key point: Distributed inference is about routing, sharding, batching, and locality as much as model execution itself.
★ Overview¶
Definition¶
Distributed inference is the execution of inference workloads across multiple processes, machines, or accelerators while coordinating requests, model state, and performance goals.
Scope¶
This note covers the architecture view of scaled serving, not the internals of any one engine.
Significance¶
- Large production LLM services are distributed by necessity.
- Cost, latency, and availability are inseparable at this layer.
- This topic sits between inference optimization and platform engineering.
Prerequisites¶
★ Deep Dive¶
Common Distributed Serving Shapes¶
| Pattern | Why Teams Use It |
|---|---|
| Replica scaling | more copies for more traffic |
| Model sharding | fit larger models across multiple GPUs |
| Request routing | send traffic by tenant, model, region, or workload |
| Batch workers | improve hardware utilization |
| Multi-tier routing | cheap path for easy tasks, strong path for hard tasks |
Core Architectural Questions¶
- Does the model fit on one GPU or need sharding?
- Are requests interactive or batch?
- How much context and KV-cache growth should the system support?
- How should overload be handled?
- What locality matters: model, cache, tenant, or region?
Request Path Example¶
client
-> gateway
-> router
-> queue or scheduler
-> inference workers
-> cache / state tracking
-> streaming response
What Makes This Hard¶
| Challenge | Why It Hurts |
|---|---|
| KV-cache locality | moving or rebuilding cache is expensive |
| Uneven request lengths | long requests create tail latency |
| Burst traffic | overload spreads quickly across the cluster |
| Multi-model fleets | capacity planning becomes messy |
| GPU fragmentation | expensive hardware gets stranded inefficiently |
Important Patterns¶
- separate interactive and batch workloads
- keep hot models warm
- use routing to avoid unnecessary expensive paths
- stream where user experience benefits
- apply backpressure before the cluster collapses
Multi-Region Thinking¶
Teams may distribute inference across regions for:
- lower latency
- resilience
- data-governance boundaries
That introduces trade-offs in cache reuse, model synchronization, and observability.
Design Heuristics¶
- Keep the serving topology simpler than you think you need at first.
- Avoid mixing all workload types on the same critical path.
- Monitor tail latency, queue depth, and GPU utilization together.
- Treat routing policy as a product and cost lever.
- Plan failure behavior explicitly.
◆ Quick Reference¶
| Problem | First Architecture Move |
|---|---|
| one GPU cannot hold the model | use model sharding or a smaller/quantized model |
| traffic spikes | add queueing, autoscaling, and backpressure |
| long requests hurt everyone | isolate workloads or add length-aware scheduling |
| costs are unstable | add routing tiers and cache strategy |
| regional latency too high | consider multi-region serving |
○ Gotchas & Common Mistakes¶
- More replicas do not fix memory-fit problems.
- Throughput optimization can damage interactive latency.
- Cache design is often the hidden determinant of real performance.
○ Interview Angles¶
- Q: What is the difference between scaling replicas and sharding a model?
-
A: Replica scaling duplicates the full serving stack to handle more requests, while model sharding splits one model across multiple devices because it is too large or too expensive to serve as one unit.
-
Q: Why is KV-cache locality important?
- A: Because rebuilding or transferring cache state is costly. Good locality helps keep follow-up tokens and turns efficient instead of repeatedly paying for the same context.
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Inference Optimization, Model Serving for LLM Applications, Distributed Systems Fundamentals for AI |
| Leads to | inference platform engineering, serving optimization |
| Compare with | single-node serving |
| Cross-domain | queueing, scheduling, cluster design |
★ Code & Implementation¶
vLLM Distributed Serving¶
# pip install vllm>=0.8
# ⚠️ Last tested: 2026-04 | Requires: vllm>=0.8
# Launch vLLM with tensor parallelism across 4 GPUs
# Command line:
# python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-3.1-70B-Instruct \
# --tensor-parallel-size 4 \
# --max-model-len 8192 \
# --gpu-memory-utilization 0.9 \
# --port 8000
# Python client (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": "Explain KV-cache in 2 sentences."}],
max_tokens=100,
)
print(response.choices[0].message.content)
# Expected: Served across 4 GPUs with automatic tensor parallelism
Multi-GPU Health Monitor¶
# ⚠️ Last tested: 2026-04 | Requires: nvidia-ml-py3 (pynvml), Python 3.10+
import pynvml
from dataclasses import dataclass
pynvml.nvmlInit()
@dataclass
class GPUHealth:
index: int
name: str
temp_c: int
utilization_pct: int
memory_used_gb: float
memory_total_gb: float
power_w: float
healthy: bool
def check_gpu_health(max_temp: int = 85, min_free_gb: float = 2.0) -> list[GPUHealth]:
"""Monitor all GPUs for distributed serving health checks."""
count = pynvml.nvmlDeviceGetCount()
results = []
for i in range(count):
h = pynvml.nvmlDeviceGetHandleByIndex(i)
temp = pynvml.nvmlDeviceGetTemperature(h, pynvml.NVML_TEMPERATURE_GPU)
util = pynvml.nvmlDeviceGetUtilizationRates(h)
mem = pynvml.nvmlDeviceGetMemoryInfo(h)
free_gb = (mem.total - mem.used) / (1024**3)
results.append(GPUHealth(
index=i, name=pynvml.nvmlDeviceGetName(h), temp_c=temp,
utilization_pct=util.gpu,
memory_used_gb=round(mem.used / (1024**3), 1),
memory_total_gb=round(mem.total / (1024**3), 1),
power_w=round(pynvml.nvmlDeviceGetPowerUsage(h) / 1000, 1),
healthy=(temp < max_temp and free_gb > min_free_gb),
))
return results
# for gpu in check_gpu_health():
# s = "OK" if gpu.healthy else "ALERT"
# print(f"GPU{gpu.index} [{s}]: {gpu.temp_c}C, {gpu.utilization_pct}% util, "
# f"{gpu.memory_used_gb}/{gpu.memory_total_gb}GB")
# Expected output: Per-GPU health status with temperature, utilization, memory
Prefill/Decode Disaggregation Router¶
# ⚠️ Last tested: 2026-04 | Requires: Python 3.10+ (architecture pattern)
# P/D disaggregation: separate GPU pools for prefill (compute-bound) vs decode (memory-bound)
# Reference: DistServe (2024), Splitwise (2024)
from dataclasses import dataclass, field
from enum import Enum
class Phase(Enum):
PREFILL = "prefill"
DECODE = "decode"
@dataclass
class PDRouter:
"""Route LLM requests to specialized GPU pools by phase."""
prefill_pool: list[str] = field(default_factory=lambda: ["gpu-0", "gpu-1"])
decode_pool: list[str] = field(default_factory=lambda: ["gpu-2", "gpu-3", "gpu-4", "gpu-5"])
_pf_idx: int = 0
_dc_idx: int = 0
def route(self, phase: Phase) -> str:
if phase == Phase.PREFILL:
w = self.prefill_pool[self._pf_idx % len(self.prefill_pool)]
self._pf_idx += 1
return w
w = self.decode_pool[self._dc_idx % len(self.decode_pool)]
self._dc_idx += 1
return w
def serve(self, input_tokens: int, output_tokens: int) -> dict:
pf = self.route(Phase.PREFILL)
dc = self.route(Phase.DECODE)
pf_ms = input_tokens * 0.01
dc_ms = output_tokens * 5.0
return {"prefill": pf, "pf_ms": round(pf_ms, 1),
"decode": dc, "dc_ms": round(dc_ms, 1),
"total_ms": round(pf_ms + dc_ms, 1)}
router = PDRouter()
result = router.serve(input_tokens=2000, output_tokens=200)
print(f"Prefill: {result['prefill']} ({result['pf_ms']}ms)")
print(f"Decode: {result['decode']} ({result['dc_ms']}ms)")
print(f"Total: {result['total_ms']}ms")
# Expected output:
# Prefill: gpu-0 (20.0ms)
# Decode: gpu-2 (1000.0ms)
# Total: 1020.0ms
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| KV-cache OOM | Server rejects new requests, GPU memory exhausted | Too many concurrent long-context requests | Set max_model_len, implement request queuing, monitor cache utilization |
| Tail latency spike | P99 latency 10× worse than P50 | One long request blocks batch, straggler GPU | Length-aware scheduling, separate long/short request queues |
| GPU fragmentation | Low utilization despite high demand | Requests don't fill GPU batches efficiently | Dynamic batching (vLLM continuous batching), right-size GPU allocation |
| Cascade failure | All replicas go down simultaneously | Shared dependency failure, no circuit breakers | Health checks, circuit breakers, graceful degradation |
| NCCL timeout | Training/serving hangs silently for minutes | Network partition between GPUs, slow interconnect | NCCL timeout tuning, heartbeat monitoring, automatic restart |
◆ Hands-On Exercises¶
Exercise 1: Benchmark Serving Throughput¶
Goal: Compare single-GPU vs multi-GPU serving performance Time: 45 minutes Steps: 1. Deploy a 7B model on 1 GPU with vLLM, benchmark with 100 concurrent requests 2. Deploy the same model on 2 GPUs with tensor parallelism, repeat benchmark 3. Measure: throughput (tokens/sec), P50/P95/P99 latency, GPU utilization 4. Analyze: when does adding GPUs help vs hurt? Expected Output: Throughput/latency comparison table, scaling efficiency analysis
Exercise 2: Design a P/D Disaggregated Architecture¶
Goal: Design a serving system for a chat product with 10K concurrent users Time: 30 minutes Steps: 1. Given: average 500-token input, 200-token output, target P99 < 2s 2. Calculate: prefill compute budget vs decode memory budget 3. Determine: optimal GPU split (how many prefill vs decode workers?) 4. Draw: architecture diagram showing request flow and KV-cache transfer Expected Output: Architecture diagram + capacity planning table
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 🔧 Hands-on | vLLM Documentation | Best open-source LLM serving engine — PagedAttention, continuous batching |
| 🔧 Hands-on | TGI Documentation | HuggingFace's production serving engine |
| 📄 Paper | Kwon et al. "PagedAttention" (vLLM, 2023) | The paper that revolutionized KV-cache management for LLM serving |
| 📘 Book | "AI Engineering" by Chip Huyen (2025), Ch 8 | Covers serving architecture, batching, and scaling patterns |
★ Sources¶
- vLLM documentation — https://docs.vllm.ai/
- TGI documentation — https://huggingface.co/docs/text-generation-inference/
- Inference Optimization
- Distributed Systems Fundamentals for AI