Model Serving for LLM Applications¶

Training creates a model. Serving turns it into a dependable API with latency, throughput, and failure behavior you can actually reason about.

★ TL;DR¶

What: The systems and patterns used to expose models as production endpoints.
Why: A strong model with weak serving still feels slow, flaky, and expensive.
Key point: Serving is a scheduling and systems problem, not just a "wrap it in FastAPI" problem.

★ Overview¶

Definition¶

Model serving is the runtime layer that accepts requests, prepares inputs, executes inference, and returns outputs under production constraints.

Scope¶

This note covers serving architectures, runtime choices, and operational trade-offs for LLM systems. For lower-level performance techniques, see Inference Optimization. For platform packaging, see Docker & Kubernetes for GenAI Deployment.

Significance¶

Serving determines the real user experience more than benchmark scores do.
Self-hosted GenAI teams spend major effort on batching, scheduling, and memory efficiency.
Serving knowledge is central to MLOps, LLMOps, and inference engineering roles.

Prerequisites¶

★ Deep Dive¶

Serving Request Path¶

Client request
-> auth and rate limiting
-> request validation
-> prompt / input shaping
-> routing to model or serving engine
-> batching and scheduling
-> inference runtime
-> output validation / formatting
-> metrics, traces, and logs
-> response

Core Serving Concerns¶

Concern	Why It Matters
Latency	Interactive chat feels broken when time-to-first-token is poor
Throughput	Determines how many concurrent users the system can handle
Memory	LLM serving is often bottlenecked by model and KV-cache memory
Reliability	Timeouts, retries, and overload behavior must be explicit
Cost	Serving choices directly shape GPU usage and token economics

Common Serving Patterns¶

Pattern	Best For	Trade-Off
Managed API	Fast iteration, low infra overhead	Less control, possible vendor lock-in
Self-hosted open model	Privacy, cost control, fine-tuned models	Need GPU and ops maturity
Hybrid routing	Mixed workloads and cost tuning	More complexity
Async batch serving	Offline generation and evaluation	Not ideal for interactive UX

Serving Engines¶

Engine	Best Known For	Typical Fit
vLLM	High-throughput open-source LLM serving	General self-hosted LLM serving
TGI	Hugging Face ecosystem integration	Teams already in HF stack
Triton Inference Server	Multi-model, multi-backend serving	Broader ML platform setups
SGLang	Efficient serving for structured generation workloads	High-throughput advanced setups
Ollama	Local developer ergonomics	Local testing, not primary production stack

API Shape Decisions¶

Choose early whether you need:

synchronous response vs streaming
chat format vs raw completion format
tool-call capable outputs
strict JSON schema outputs
tenant-aware rate limits

These choices affect the gateway, eval harness, and downstream clients.

Batch vs Interactive Serving¶

Mode	Optimize For	Typical Examples
Interactive	Low latency and fast first token	Chat, copilots, agents
Batch	Throughput and unit cost	Classification, offline summaries, eval runs

Practical Metrics¶

Metric	Why It Matters
TTFT	User perceives responsiveness through first token speed
Tokens/sec	Measures generation throughput
Requests/sec	Endpoint capacity indicator
GPU utilization	Tells whether hardware is being used efficiently
P95 latency	Better than averages for production reliability
Error rate	Helps separate overload from semantic failures

Minimal Self-Hosted OpenAI-Compatible Serving¶

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

Design Heuristics¶

Start with the simplest serving mode that meets the product need.
Separate gateway concerns from model runtime concerns.
Stream responses for chat-like experiences when possible.
Add caching, batching, and routing only after measuring real bottlenecks.
Treat overload behavior as part of product design.

◆ Quick Reference¶

Problem	First Serving Move
High API cost	Evaluate self-hosting or smaller-model routing
Slow first token	Reduce prompt size, enable streaming, inspect prefill path
GPU memory pressure	Quantize, reduce batch size, inspect KV-cache growth
Uneven traffic	Add queueing, autoscaling, and backpressure
Mixed workloads	Split interactive and batch paths

○ Gotchas & Common Mistakes¶

Teams often blame the model when the real bottleneck is gateway design or retrieval latency.
A single serving stack for every workload usually performs badly.
OpenAI-compatible APIs simplify clients but do not remove serving complexity.
P50 latency can look healthy while P95 latency is unacceptable.

○ Interview Angles¶

Q: What is the difference between inference optimization and model serving?
A: Inference optimization focuses on making the core generation path more efficient, for example quantization or KV-cache improvements. Model serving covers the full production runtime around that path, including APIs, routing, scheduling, scaling, and failure handling.
Q: When would you self-host instead of using a managed API?
A: When privacy, volume economics, model customization, or latency control outweigh the extra operational burden. Otherwise, managed APIs are usually the faster path.

★ Code & Implementation¶

vLLM Server Setup (OpenAI-Compatible)¶

# pip install vllm>=0.8
# ⚠️ Last tested: 2026-04 | Requires: CUDA GPU, vllm>=0.8

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-8B-Instruct \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --port 8000

# Query the vLLM server — identical to OpenAI API
# pip install openai>=1.60
# ⚠️ Last tested: 2026-04
from openai import OpenAI
import time

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Throughput test: 30 requests
prompts = ["What is RAG?", "Explain LoRA.", "What is MoE?"] * 10
start = time.monotonic()
for p in prompts:
    client.chat.completions.create(
        model="meta-llama/Llama-3.2-8B-Instruct",
        messages=[{"role": "user", "content": p}],
        max_tokens=50,
    )
elapsed = time.monotonic() - start
print(f"{len(prompts)} requests in {elapsed:.1f}s = {len(prompts)/elapsed:.1f} req/s")

★ Connections¶

Relationship	Topics
Builds on	Inference Optimization, Docker & Kubernetes for GenAI Deployment, AI System Design for GenAI Applications
Leads to	Monitoring & Observability for GenAI Systems, Cost Optimization for GenAI Systems
Compare with	Managed API consumption, classical REST service deployment
Cross-domain	Distributed systems, queueing, API platform design

Health Check and Load Test Endpoint¶

# pip install fastapi>=0.110 uvicorn>=0.29 httpx>=0.27
# ⚠️ Last tested: 2026-04 | Requires: fastapi>=0.110, httpx>=0.27
import asyncio, time, statistics
from fastapi import FastAPI
from pydantic import BaseModel
import httpx

app = FastAPI()

class HealthStatus(BaseModel):
    status: str
    model_loaded: bool
    avg_latency_ms: float
    requests_served: int

_request_count = 0
_latencies: list[float] = []

@app.get("/health", response_model=HealthStatus)
async def health_check():
    """Production health check for LLM serving endpoint."""
    return HealthStatus(
        status="healthy" if _latencies and statistics.mean(_latencies) < 5000 else "degraded",
        model_loaded=True,
        avg_latency_ms=round(statistics.mean(_latencies), 1) if _latencies else 0,
        requests_served=_request_count,
    )

async def load_test(base_url: str, num_requests: int = 20, concurrency: int = 5) -> dict:
    """Simple load test for model serving endpoint."""
    async def single_request(client: httpx.AsyncClient, i: int) -> float:
        payload = {"model": "default", "messages": [{"role": "user", "content": f"Test {i}"}], "max_tokens": 50}
        t0 = time.monotonic()
        resp = await client.post(f"{base_url}/v1/chat/completions", json=payload, timeout=30)
        elapsed = (time.monotonic() - t0) * 1000
        return elapsed if resp.status_code == 200 else -1

    async with httpx.AsyncClient() as client:
        sem = asyncio.Semaphore(concurrency)
        async def bounded(i):
            async with sem:
                return await single_request(client, i)
        results = await asyncio.gather(*[bounded(i) for i in range(num_requests)])

    ok = [r for r in results if r > 0]
    errors = sum(1 for r in results if r < 0)
    ok.sort()
    n = len(ok)
    return {
        "total": num_requests, "errors": errors,
        "p50_ms": round(ok[n//2], 1) if n else 0,
        "p95_ms": round(ok[int(n*0.95)], 1) if n else 0,
        "p99_ms": round(ok[int(n*0.99)], 1) if n else 0,
        "throughput_rps": round(num_requests / (sum(ok)/1000/concurrency), 1) if ok else 0,
    }

# result = asyncio.run(load_test("http://localhost:8000", num_requests=50, concurrency=10))
# print(f"P50: {result['p50_ms']}ms | P95: {result['p95_ms']}ms | Errors: {result['errors']}")
# Expected output: P50: ~120ms | P95: ~450ms | Errors: 0

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Cold start latency	First request takes 30-60 seconds	Model loading from disk to GPU on demand	Model preloading, warm standby replicas
GPU memory fragmentation	OOM despite sufficient total VRAM	Non-contiguous allocation from dynamic batching	vLLM paged attention, periodic defragmentation
Batch queue starvation	High-priority requests delayed by large batch	FIFO batching without priority	Priority queues, preemptive scheduling
Model version rollback failure	Can't revert after bad deployment	No versioned model registry	Model registry (MLflow), blue-green deployments

◆ Hands-On Exercises¶

Exercise 1: Deploy and Benchmark a Model Server¶

Goal: Deploy a model with vLLM and benchmark at different concurrency levels Time: 45 minutes Steps: 1. Deploy a small model with vLLM or TGI 2. Benchmark latency at 1, 10, and 50 concurrent requests 3. Implement a health check endpoint that verifies GPU readiness 4. Measure cold start time vs warm request time Expected Output: Latency vs concurrency table, health check implementation

★ Recommended Resources¶

Type	Resource	Why
🔧 Hands-on	vLLM Documentation	Best open-source LLM serving engine
🔧 Hands-on	TGI Documentation	HuggingFace's production serving solution
📄 Paper	Kwon et al. "PagedAttention" (2023)	KV-cache management that powers vLLM
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 8	Model serving patterns for production

★ Sources¶

vLLM documentation - https://docs.vllm.ai
Hugging Face Text Generation Inference documentation
NVIDIA Triton Inference Server documentation
Inference Optimization