inference
llmops
production
serving
tgi
triton
vllm
Model Serving for LLM Applications
Training creates a model. Serving turns it into a dependable API with latency, throughput, and failure behavior you can actually reason about.
★ TL;DR
What : The systems and patterns used to expose models as production endpoints.
Why : A strong model with weak serving still feels slow, flaky, and expensive.
Key point : Serving is a scheduling and systems problem, not just a "wrap it in FastAPI" problem.
★ Overview
Definition
Model serving is the runtime layer that accepts requests, prepares inputs, executes inference, and returns outputs under production constraints.
Scope
This note covers serving architectures, runtime choices, and operational trade-offs for LLM systems. For lower-level performance techniques, see Inference Optimization . For platform packaging, see Docker & Kubernetes for GenAI Deployment .
Significance
Serving determines the real user experience more than benchmark scores do.
Self-hosted GenAI teams spend major effort on batching, scheduling, and memory efficiency.
Serving knowledge is central to MLOps, LLMOps, and inference engineering roles.
Prerequisites
★ Deep Dive
Serving Request Path
Client request
-> auth and rate limiting
-> request validation
-> prompt / input shaping
-> routing to model or serving engine
-> batching and scheduling
-> inference runtime
-> output validation / formatting
-> metrics, traces, and logs
-> response
Core Serving Concerns
Concern
Why It Matters
Latency
Interactive chat feels broken when time-to-first-token is poor
Throughput
Determines how many concurrent users the system can handle
Memory
LLM serving is often bottlenecked by model and KV-cache memory
Reliability
Timeouts, retries, and overload behavior must be explicit
Cost
Serving choices directly shape GPU usage and token economics
Common Serving Patterns
Pattern
Best For
Trade-Off
Managed API
Fast iteration, low infra overhead
Less control, possible vendor lock-in
Self-hosted open model
Privacy, cost control, fine-tuned models
Need GPU and ops maturity
Hybrid routing
Mixed workloads and cost tuning
More complexity
Async batch serving
Offline generation and evaluation
Not ideal for interactive UX
Serving Engines
Engine
Best Known For
Typical Fit
vLLM
High-throughput open-source LLM serving
General self-hosted LLM serving
TGI
Hugging Face ecosystem integration
Teams already in HF stack
Triton Inference Server
Multi-model, multi-backend serving
Broader ML platform setups
SGLang
Efficient serving for structured generation workloads
High-throughput advanced setups
Ollama
Local developer ergonomics
Local testing, not primary production stack
API Shape Decisions
Choose early whether you need:
synchronous response vs streaming
chat format vs raw completion format
tool-call capable outputs
strict JSON schema outputs
tenant-aware rate limits
These choices affect the gateway, eval harness, and downstream clients.
Batch vs Interactive Serving
Mode
Optimize For
Typical Examples
Interactive
Low latency and fast first token
Chat, copilots, agents
Batch
Throughput and unit cost
Classification, offline summaries, eval runs
Practical Metrics
Metric
Why It Matters
TTFT
User perceives responsiveness through first token speed
Tokens/sec
Measures generation throughput
Requests/sec
Endpoint capacity indicator
GPU utilization
Tells whether hardware is being used efficiently
P95 latency
Better than averages for production reliability
Error rate
Helps separate overload from semantic failures
Minimal Self-Hosted OpenAI-Compatible Serving
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0 .0.0.0 \
--port 8000
Design Heuristics
Start with the simplest serving mode that meets the product need.
Separate gateway concerns from model runtime concerns.
Stream responses for chat-like experiences when possible.
Add caching, batching, and routing only after measuring real bottlenecks.
Treat overload behavior as part of product design.
◆ Quick Reference
Problem
First Serving Move
High API cost
Evaluate self-hosting or smaller-model routing
Slow first token
Reduce prompt size, enable streaming, inspect prefill path
GPU memory pressure
Quantize, reduce batch size, inspect KV-cache growth
Uneven traffic
Add queueing, autoscaling, and backpressure
Mixed workloads
Split interactive and batch paths
○ Gotchas & Common Mistakes
Teams often blame the model when the real bottleneck is gateway design or retrieval latency.
A single serving stack for every workload usually performs badly.
OpenAI-compatible APIs simplify clients but do not remove serving complexity.
P50 latency can look healthy while P95 latency is unacceptable.
○ Interview Angles
Q : What is the difference between inference optimization and model serving?
A : Inference optimization focuses on making the core generation path more efficient, for example quantization or KV-cache improvements. Model serving covers the full production runtime around that path, including APIs, routing, scheduling, scaling, and failure handling.
Q : When would you self-host instead of using a managed API?
A : When privacy, volume economics, model customization, or latency control outweigh the extra operational burden. Otherwise, managed APIs are usually the faster path.
★ Code & Implementation
vLLM Server Setup (OpenAI-Compatible)
# pip install vllm>=0.8
# ⚠️ Last tested: 2026-04 | Requires: CUDA GPU, vllm>=0.8
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-8B-Instruct \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0 .90 \
--max-model-len 8192 \
--port 8000
# Query the vLLM server — identical to OpenAI API
# pip install openai>=1.60
# ⚠️ Last tested: 2026-04
from openai import OpenAI
import time
client = OpenAI ( base_url = "http://localhost:8000/v1" , api_key = "not-needed" )
# Throughput test: 30 requests
prompts = [ "What is RAG?" , "Explain LoRA." , "What is MoE?" ] * 10
start = time . monotonic ()
for p in prompts :
client . chat . completions . create (
model = "meta-llama/Llama-3.2-8B-Instruct" ,
messages = [{ "role" : "user" , "content" : p }],
max_tokens = 50 ,
)
elapsed = time . monotonic () - start
print ( f " { len ( prompts ) } requests in { elapsed : .1f } s = { len ( prompts ) / elapsed : .1f } req/s" )
★ Connections
Health Check and Load Test Endpoint
# pip install fastapi>=0.110 uvicorn>=0.29 httpx>=0.27
# ⚠️ Last tested: 2026-04 | Requires: fastapi>=0.110, httpx>=0.27
import asyncio , time , statistics
from fastapi import FastAPI
from pydantic import BaseModel
import httpx
app = FastAPI ()
class HealthStatus ( BaseModel ):
status : str
model_loaded : bool
avg_latency_ms : float
requests_served : int
_request_count = 0
_latencies : list [ float ] = []
@app . get ( "/health" , response_model = HealthStatus )
async def health_check ():
"""Production health check for LLM serving endpoint."""
return HealthStatus (
status = "healthy" if _latencies and statistics . mean ( _latencies ) < 5000 else "degraded" ,
model_loaded = True ,
avg_latency_ms = round ( statistics . mean ( _latencies ), 1 ) if _latencies else 0 ,
requests_served = _request_count ,
)
async def load_test ( base_url : str , num_requests : int = 20 , concurrency : int = 5 ) -> dict :
"""Simple load test for model serving endpoint."""
async def single_request ( client : httpx . AsyncClient , i : int ) -> float :
payload = { "model" : "default" , "messages" : [{ "role" : "user" , "content" : f "Test { i } " }], "max_tokens" : 50 }
t0 = time . monotonic ()
resp = await client . post ( f " { base_url } /v1/chat/completions" , json = payload , timeout = 30 )
elapsed = ( time . monotonic () - t0 ) * 1000
return elapsed if resp . status_code == 200 else - 1
async with httpx . AsyncClient () as client :
sem = asyncio . Semaphore ( concurrency )
async def bounded ( i ):
async with sem :
return await single_request ( client , i )
results = await asyncio . gather ( * [ bounded ( i ) for i in range ( num_requests )])
ok = [ r for r in results if r > 0 ]
errors = sum ( 1 for r in results if r < 0 )
ok . sort ()
n = len ( ok )
return {
"total" : num_requests , "errors" : errors ,
"p50_ms" : round ( ok [ n // 2 ], 1 ) if n else 0 ,
"p95_ms" : round ( ok [ int ( n * 0.95 )], 1 ) if n else 0 ,
"p99_ms" : round ( ok [ int ( n * 0.99 )], 1 ) if n else 0 ,
"throughput_rps" : round ( num_requests / ( sum ( ok ) / 1000 / concurrency ), 1 ) if ok else 0 ,
}
# result = asyncio.run(load_test("http://localhost:8000", num_requests=50, concurrency=10))
# print(f"P50: {result['p50_ms']}ms | P95: {result['p95_ms']}ms | Errors: {result['errors']}")
# Expected output: P50: ~120ms | P95: ~450ms | Errors: 0
◆ Production Failure Modes
Failure
Symptoms
Root Cause
Mitigation
Cold start latency
First request takes 30-60 seconds
Model loading from disk to GPU on demand
Model preloading, warm standby replicas
GPU memory fragmentation
OOM despite sufficient total VRAM
Non-contiguous allocation from dynamic batching
vLLM paged attention, periodic defragmentation
Batch queue starvation
High-priority requests delayed by large batch
FIFO batching without priority
Priority queues, preemptive scheduling
Model version rollback failure
Can't revert after bad deployment
No versioned model registry
Model registry (MLflow), blue-green deployments
◆ Hands-On Exercises
Exercise 1: Deploy and Benchmark a Model Server
Goal : Deploy a model with vLLM and benchmark at different concurrency levels
Time : 45 minutes
Steps :
1. Deploy a small model with vLLM or TGI
2. Benchmark latency at 1, 10, and 50 concurrent requests
3. Implement a health check endpoint that verifies GPU readiness
4. Measure cold start time vs warm request time
Expected Output : Latency vs concurrency table, health check implementation
★ Recommended Resources
★ Sources
vLLM documentation - https://docs.vllm.ai
Hugging Face Text Generation Inference documentation
NVIDIA Triton Inference Server documentation
Inference Optimization