Skip to content

Model Serving for LLM Applications

Training creates a model. Serving turns it into a dependable API with latency, throughput, and failure behavior you can actually reason about.


★ TL;DR

  • What: The systems and patterns used to expose models as production endpoints.
  • Why: A strong model with weak serving still feels slow, flaky, and expensive.
  • Key point: Serving is a scheduling and systems problem, not just a "wrap it in FastAPI" problem.

★ Overview

Definition

Model serving is the runtime layer that accepts requests, prepares inputs, executes inference, and returns outputs under production constraints.

Scope

This note covers serving architectures, runtime choices, and operational trade-offs for LLM systems. For lower-level performance techniques, see Inference Optimization. For platform packaging, see Docker & Kubernetes for GenAI Deployment.

Significance

  • Serving determines the real user experience more than benchmark scores do.
  • Self-hosted GenAI teams spend major effort on batching, scheduling, and memory efficiency.
  • Serving knowledge is central to MLOps, LLMOps, and inference engineering roles.

Prerequisites


★ Deep Dive

Serving Request Path

Client request
-> auth and rate limiting
-> request validation
-> prompt / input shaping
-> routing to model or serving engine
-> batching and scheduling
-> inference runtime
-> output validation / formatting
-> metrics, traces, and logs
-> response

Core Serving Concerns

Concern Why It Matters
Latency Interactive chat feels broken when time-to-first-token is poor
Throughput Determines how many concurrent users the system can handle
Memory LLM serving is often bottlenecked by model and KV-cache memory
Reliability Timeouts, retries, and overload behavior must be explicit
Cost Serving choices directly shape GPU usage and token economics

Common Serving Patterns

Pattern Best For Trade-Off
Managed API Fast iteration, low infra overhead Less control, possible vendor lock-in
Self-hosted open model Privacy, cost control, fine-tuned models Need GPU and ops maturity
Hybrid routing Mixed workloads and cost tuning More complexity
Async batch serving Offline generation and evaluation Not ideal for interactive UX

Serving Engines

Engine Best Known For Typical Fit
vLLM High-throughput open-source LLM serving General self-hosted LLM serving
TGI Hugging Face ecosystem integration Teams already in HF stack
Triton Inference Server Multi-model, multi-backend serving Broader ML platform setups
SGLang Efficient serving for structured generation workloads High-throughput advanced setups
Ollama Local developer ergonomics Local testing, not primary production stack

API Shape Decisions

Choose early whether you need:

  • synchronous response vs streaming
  • chat format vs raw completion format
  • tool-call capable outputs
  • strict JSON schema outputs
  • tenant-aware rate limits

These choices affect the gateway, eval harness, and downstream clients.

Batch vs Interactive Serving

Mode Optimize For Typical Examples
Interactive Low latency and fast first token Chat, copilots, agents
Batch Throughput and unit cost Classification, offline summaries, eval runs

Practical Metrics

Metric Why It Matters
TTFT User perceives responsiveness through first token speed
Tokens/sec Measures generation throughput
Requests/sec Endpoint capacity indicator
GPU utilization Tells whether hardware is being used efficiently
P95 latency Better than averages for production reliability
Error rate Helps separate overload from semantic failures

Minimal Self-Hosted OpenAI-Compatible Serving

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

Design Heuristics

  1. Start with the simplest serving mode that meets the product need.
  2. Separate gateway concerns from model runtime concerns.
  3. Stream responses for chat-like experiences when possible.
  4. Add caching, batching, and routing only after measuring real bottlenecks.
  5. Treat overload behavior as part of product design.

◆ Quick Reference

Problem First Serving Move
High API cost Evaluate self-hosting or smaller-model routing
Slow first token Reduce prompt size, enable streaming, inspect prefill path
GPU memory pressure Quantize, reduce batch size, inspect KV-cache growth
Uneven traffic Add queueing, autoscaling, and backpressure
Mixed workloads Split interactive and batch paths

○ Gotchas & Common Mistakes

  • Teams often blame the model when the real bottleneck is gateway design or retrieval latency.
  • A single serving stack for every workload usually performs badly.
  • OpenAI-compatible APIs simplify clients but do not remove serving complexity.
  • P50 latency can look healthy while P95 latency is unacceptable.

○ Interview Angles

  • Q: What is the difference between inference optimization and model serving?
  • A: Inference optimization focuses on making the core generation path more efficient, for example quantization or KV-cache improvements. Model serving covers the full production runtime around that path, including APIs, routing, scheduling, scaling, and failure handling.

  • Q: When would you self-host instead of using a managed API?

  • A: When privacy, volume economics, model customization, or latency control outweigh the extra operational burden. Otherwise, managed APIs are usually the faster path.

★ Code & Implementation

vLLM Server Setup (OpenAI-Compatible)

# pip install vllm>=0.8
# ⚠️ Last tested: 2026-04 | Requires: CUDA GPU, vllm>=0.8

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-8B-Instruct \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --port 8000
# Query the vLLM server — identical to OpenAI API
# pip install openai>=1.60
# ⚠️ Last tested: 2026-04
from openai import OpenAI
import time

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Throughput test: 30 requests
prompts = ["What is RAG?", "Explain LoRA.", "What is MoE?"] * 10
start = time.monotonic()
for p in prompts:
    client.chat.completions.create(
        model="meta-llama/Llama-3.2-8B-Instruct",
        messages=[{"role": "user", "content": p}],
        max_tokens=50,
    )
elapsed = time.monotonic() - start
print(f"{len(prompts)} requests in {elapsed:.1f}s = {len(prompts)/elapsed:.1f} req/s")

★ Connections

Relationship Topics
Builds on Inference Optimization, Docker & Kubernetes for GenAI Deployment, AI System Design for GenAI Applications
Leads to Monitoring & Observability for GenAI Systems, Cost Optimization for GenAI Systems
Compare with Managed API consumption, classical REST service deployment
Cross-domain Distributed systems, queueing, API platform design

Health Check and Load Test Endpoint

# pip install fastapi>=0.110 uvicorn>=0.29 httpx>=0.27
# ⚠️ Last tested: 2026-04 | Requires: fastapi>=0.110, httpx>=0.27
import asyncio, time, statistics
from fastapi import FastAPI
from pydantic import BaseModel
import httpx

app = FastAPI()

class HealthStatus(BaseModel):
    status: str
    model_loaded: bool
    avg_latency_ms: float
    requests_served: int

_request_count = 0
_latencies: list[float] = []

@app.get("/health", response_model=HealthStatus)
async def health_check():
    """Production health check for LLM serving endpoint."""
    return HealthStatus(
        status="healthy" if _latencies and statistics.mean(_latencies) < 5000 else "degraded",
        model_loaded=True,
        avg_latency_ms=round(statistics.mean(_latencies), 1) if _latencies else 0,
        requests_served=_request_count,
    )

async def load_test(base_url: str, num_requests: int = 20, concurrency: int = 5) -> dict:
    """Simple load test for model serving endpoint."""
    async def single_request(client: httpx.AsyncClient, i: int) -> float:
        payload = {"model": "default", "messages": [{"role": "user", "content": f"Test {i}"}], "max_tokens": 50}
        t0 = time.monotonic()
        resp = await client.post(f"{base_url}/v1/chat/completions", json=payload, timeout=30)
        elapsed = (time.monotonic() - t0) * 1000
        return elapsed if resp.status_code == 200 else -1

    async with httpx.AsyncClient() as client:
        sem = asyncio.Semaphore(concurrency)
        async def bounded(i):
            async with sem:
                return await single_request(client, i)
        results = await asyncio.gather(*[bounded(i) for i in range(num_requests)])

    ok = [r for r in results if r > 0]
    errors = sum(1 for r in results if r < 0)
    ok.sort()
    n = len(ok)
    return {
        "total": num_requests, "errors": errors,
        "p50_ms": round(ok[n//2], 1) if n else 0,
        "p95_ms": round(ok[int(n*0.95)], 1) if n else 0,
        "p99_ms": round(ok[int(n*0.99)], 1) if n else 0,
        "throughput_rps": round(num_requests / (sum(ok)/1000/concurrency), 1) if ok else 0,
    }

# result = asyncio.run(load_test("http://localhost:8000", num_requests=50, concurrency=10))
# print(f"P50: {result['p50_ms']}ms | P95: {result['p95_ms']}ms | Errors: {result['errors']}")
# Expected output: P50: ~120ms | P95: ~450ms | Errors: 0

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Cold start latency First request takes 30-60 seconds Model loading from disk to GPU on demand Model preloading, warm standby replicas
GPU memory fragmentation OOM despite sufficient total VRAM Non-contiguous allocation from dynamic batching vLLM paged attention, periodic defragmentation
Batch queue starvation High-priority requests delayed by large batch FIFO batching without priority Priority queues, preemptive scheduling
Model version rollback failure Can't revert after bad deployment No versioned model registry Model registry (MLflow), blue-green deployments

◆ Hands-On Exercises

Exercise 1: Deploy and Benchmark a Model Server

Goal: Deploy a model with vLLM and benchmark at different concurrency levels Time: 45 minutes Steps: 1. Deploy a small model with vLLM or TGI 2. Benchmark latency at 1, 10, and 50 concurrent requests 3. Implement a health check endpoint that verifies GPU readiness 4. Measure cold start time vs warm request time Expected Output: Latency vs concurrency table, health check implementation


Type Resource Why
🔧 Hands-on vLLM Documentation Best open-source LLM serving engine
🔧 Hands-on TGI Documentation HuggingFace's production serving solution
📄 Paper Kwon et al. "PagedAttention" (2023) KV-cache management that powers vLLM
📘 Book "AI Engineering" by Chip Huyen (2025), Ch 8 Model serving patterns for production

★ Sources

  • vLLM documentation - https://docs.vllm.ai
  • Hugging Face Text Generation Inference documentation
  • NVIDIA Triton Inference Server documentation
  • Inference Optimization