Skip to content

LLMOps & Production Deployment

Bit: Anyone can call an API in a notebook. Getting that same API to serve 10,000 users reliably, cheaply, safely, and without hallucinating financial advice? That's LLMOps. It's the difference between a demo and a product.


★ TL;DR

  • What: The practices, tools, and pipelines for deploying, monitoring, and maintaining LLM applications in production
  • Why: 90% of GenAI projects fail to reach production. LLMOps is what separates "cool prototype" from "reliable product"
  • Key point: LLMs are non-deterministic, expensive, and can hallucinate. You need different operational practices than traditional software.

★ Overview

Definition

LLMOps extends MLOps and DevOps to address the unique challenges of LLM applications: non-deterministic outputs, prompt management, token cost tracking, hallucination monitoring, and safety guardrails — all while maintaining the reliability users expect.

Scope

Covers the production lifecycle. For deployment packaging, see Docker & Kubernetes for GenAI Deployment. For runtime design, see Model Serving for LLM Applications. For tracing and ops telemetry, see Monitoring & Observability for GenAI Systems. For release automation, see CI/CD for ML and LLM Systems. For economics, see Cost Optimization for GenAI Systems. For lower-level optimization, see Inference Optimization.

Significance

  • This is what "senior GenAI engineer" actually means in job descriptions
  • Companies are investing more in LLMOps than in model training
  • Understanding this = you can build AND ship, not just experiment

★ Deep Dive

The LLMOps Stack

┌─────────────────────────────────────────────────────────┐
│                    USER REQUESTS                         │
├─────────────────────────────────────────────────────────┤
│  GATEWAY LAYER                                          │
│  Rate limiting │ Auth │ Request routing │ Load balancing │
├─────────────────────────────────────────────────────────┤
│  GUARDRAILS (Input)                                     │
│  Prompt injection detection │ PII scrubbing │ Validation │
├─────────────────────────────────────────────────────────┤
│  APPLICATION LOGIC                                      │
│  RAG pipeline │ Agent loops │ Chain orchestration        │
├────────────────────┬────────────────────────────────────┤
│  LLM LAYER         │  CACHE LAYER                       │
│  API calls │       │  Semantic cache │ Exact cache       │
│  Self-hosted       │  (save $$$ on repeated queries)    │
├────────────────────┴────────────────────────────────────┤
│  GUARDRAILS (Output)                                    │
│  Hallucination check │ Toxicity filter │ PII detection  │
├─────────────────────────────────────────────────────────┤
│  OBSERVABILITY                                          │
│  Tracing │ Logging │ Metrics │ Cost tracking │ Alerts   │
├─────────────────────────────────────────────────────────┤
│  EVALUATION (Continuous)                                │
│  Automated evals │ Human feedback │ Regression testing  │
└─────────────────────────────────────────────────────────┘

Prompt Management

THE PROBLEM:
  Prompts are like code — they need version control.
  But they're stored in strings, not files.
  One bad prompt change can break everything.

SOLUTION: Treat prompts as first-class artifacts

  ┌──────────────────────────────────────────────┐
  │  PROMPT LIFECYCLE                            │
  │                                              │
  │  1. Write prompt (with template variables)   │
  │  2. Test against eval suite (golden examples)│
  │  3. Version it (v1.0, v1.1, v2.0)           │
  │  4. A/B test in production (v1 vs v2)        │
  │  5. Monitor quality metrics                  │
  │  6. Rollback if quality drops                │
  └──────────────────────────────────────────────┘

TOOLS:
  - LangSmith (LangChain) — prompt playground + versioning
  - Braintrust — prompt testing + A/B testing
  - Promptfoo — CLI-based prompt testing
  - Portkey — AI gateway with prompt management

Monitoring & Observability

What to Monitor Why Tool
Latency (TTFT, total) User experience Langfuse, LangSmith
Token usage Cost control Portkey, custom logging
Error rates Reliability Any APM + custom
Quality scores Hallucination, relevance RAGAS, DeepEval
Cost per query Budget management Portkey, custom
Guardrail triggers Safety monitoring NeMo, Lakera
User feedback Ground truth Custom (👍/👎 buttons)
Drift Performance degradation over time Arize Phoenix
# ⚠️ Last tested: 2026-04
# ═══ Basic LLM Observability with Langfuse ═══
from langfuse.openai import openai  # Drop-in replacement

# Every call is now automatically traced
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain RAG"}],
    metadata={"user_id": "user_123", "session": "abc"}
)
# → Langfuse dashboard shows: latency, tokens, cost, trace

# ═══ Semantic Caching (reduce costs 30-60%) ═══
# If a similar question was asked before, return cached answer
from gptcache import cache
cache.init()  # Initialize semantic cache
# Similar questions → cache hit → save tokens + latency

Deployment Patterns

Pattern When Pros Cons
API-only (OpenAI, Anthropic) Fast start, simple Easy, no infra Cost at scale, vendor lock
API + Gateway (Portkey, LiteLLM) Multi-model, production Fallbacks, load balancing Extra layer
Self-hosted (vLLM + open model) Data privacy, cost control Full control, no vendor GPU infra needed
Hybrid (API for hard, self-host for easy) Cost optimization Best of both Complex routing
DEPLOYMENT CHECKLIST:
  □ Rate limiting (protect against abuse)
  □ API key management (rotate, scope)
  □ Retry logic with exponential backoff
  □ Fallback models (if primary fails)
  □ Cost alerts (daily/monthly budgets)
  □ Response logging (for debugging + eval)
  □ User feedback collection (👍/👎)
  □ Automated eval suite (run on every change)
  □ Guardrails (input + output)
  □ Health checks and uptime monitoring

CI/CD for LLM Applications

TRADITIONAL CI/CD:
  Code change → Run tests → Deploy if tests pass

LLM CI/CD (additional steps):
  Prompt change → Run eval suite → Compare with baseline
                → If better → Deploy (canary)
                → If worse → Block deployment

  ┌───────────────────────────────────────────────────┐
  │  PROMPT/CODE CHANGE                               │
  │       │                                           │
  │       ▼                                           │
  │  ┌─────────────┐                                  │
  │  │ Unit Tests   │ ← Traditional code tests        │
  │  └──────┬──────┘                                  │
  │         ▼                                         │
  │  ┌─────────────┐                                  │
  │  │ LLM Eval     │ ← Run prompt against golden set │
  │  │ Suite        │   Compare quality scores         │
  │  └──────┬──────┘                                  │
  │         ▼                                         │
  │  ┌─────────────┐                                  │
  │  │ Regression   │ ← Did any existing answers get  │
  │  │ Check        │   worse? (output diff analysis)  │
  │  └──────┬──────┘                                  │
  │         ▼                                         │
  │  ┌─────────────┐                                  │
  │  │ Cost Check   │ ← Is the new prompt more         │
  │  │              │   expensive? Within budget?       │
  │  └──────┬──────┘                                  │
  │         ▼                                         │
  │  DEPLOY (canary → 5% → 50% → 100%)               │
  └───────────────────────────────────────────────────┘

Observability Platforms (2026)

Platform Type Best For
LangSmith SaaS LangChain users, full lifecycle
Langfuse Open-source Self-hosted, privacy-first
Arize Phoenix Open-source Drift monitoring, traces
Portkey AI Gateway Multi-model routing, cost tracking
Braintrust SaaS Eval + prompt management
Maxim AI SaaS Enterprise observability
Helicone SaaS Simple logging + analytics

◆ Quick Reference

COST REDUCTION STRATEGIES:
  1. Semantic caching (30-60% savings)
  2. Smaller models for simple tasks (GPT-5.4-mini/nano vs GPT-5.4)
  3. Prompt optimization (fewer tokens)
  4. Batching requests (where possible)
  5. Self-host for high-volume (break-even ~$5K/month)

INCIDENT RESPONSE:
  Model returns gibberish  → Check API status, switch to fallback
  Costs spike unexpectedly → Check for prompt injection, rate limit
  Quality drops suddenly   → API model updated? Check eval scores
  Guardrail trigger surge  → Possible attack, review logs

KEY METRICS:
  TTFT (time to first token) < 500ms for interactive
  Total latency < 5s for most queries
  Error rate < 0.1%
  Cost per query: track and budget
  Eval score regression: < 5% acceptable

○ Gotchas & Common Mistakes

  • ⚠️ No eval suite = deploying blind: You MUST have a set of golden test cases to catch regressions.
  • ⚠️ LLM APIs change without warning: OpenAI/Anthropic update models silently. Your app can break overnight. Monitor quality continuously.
  • ⚠️ Logging everything is expensive: Log smartly — sample in production, log fully in staging.
  • ⚠️ Prompt injection is a real attack: Users WILL try to override your system prompt. Always validate.
  • ⚠️ Vendor lock-in is real: Abstract your LLM calls behind an interface. Use gateways like Portkey/LiteLLM.

○ Interview Angles

  • Q: How would you take an LLM prototype to production?
  • A: (1) Create an eval suite (50+ golden examples), (2) Add input/output guardrails, (3) Implement observability (Langfuse/LangSmith), (4) Set up cost alerting, (5) Abstract the LLM provider behind a gateway for fallbacks, (6) CI/CD pipeline that runs eval suite on every prompt/code change, (7) Canary deployment with quality monitoring.

  • Q: How do you handle LLM quality degradation in production?

  • A: Continuous monitoring via automated evals, user feedback (👍/👎), drift detection. When quality drops: check if the provider updated the model, run regression analysis against golden set, roll back prompts if needed, or switch to a backup model.

★ Code & Implementation

LLM Call Tracker (Latency + Token Logging)

# pip install openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY env var
import time, uuid, logging
from openai import OpenAI

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s")
log = logging.getLogger("llmops")
client = OpenAI()

def tracked_completion(messages: list[dict], model: str = "gpt-4o-mini", **kw) -> str:
    """Production-instrumented LLM call with trace/latency/token logging."""
    trace_id = str(uuid.uuid4())[:8]
    start = time.monotonic()
    resp = client.chat.completions.create(model=model, messages=messages, **kw)
    latency_ms = (time.monotonic() - start) * 1000
    u = resp.usage
    log.info(
        "llm | trace=%s model=%s prompt_tok=%d completion_tok=%d total_tok=%d latency_ms=%.1f",
        trace_id, model, u.prompt_tokens, u.completion_tokens, u.total_tokens, latency_ms,
    )
    return resp.choices[0].message.content

result = tracked_completion(
    [{"role": "user", "content": "Summarize RAG in one sentence."}],
    max_tokens=80, temperature=0.3,
)
print(result)

A/B Prompt Experiment

# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY
import statistics

def ab_eval(prompt_a: str, prompt_b: str, test_inputs: list[str]) -> None:
    for label, system in [("A", prompt_a), ("B", prompt_b)]:
        latencies, tokens = [], []
        for user_input in test_inputs:
            start = time.monotonic()
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "system", "content": system},
                          {"role": "user",   "content": user_input}],
                max_tokens=150, temperature=0,
            )
            latencies.append((time.monotonic() - start) * 1000)
            tokens.append(resp.usage.total_tokens)
        print(f"Variant {label}: median_ms={statistics.median(latencies):.0f} "
              f"avg_tokens={statistics.mean(tokens):.1f}")

ab_eval(
    "You are a concise assistant. Answer in one sentence.",
    "You are a helpful assistant.",
    ["What is RAG?", "What is LoRA?", "What is MoE?"],
)

★ Connections

Relationship Topics
Builds on Llms Overview, Evaluation And Benchmarks, Ethics Safety Alignment
Leads to Enterprise AI deployment, Scalable AI systems
Compare with Traditional MLOps (ML models), DevOps (software)
Cross-domain Site Reliability Engineering, Platform engineering

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Prompt-model version skew Prompt templates break after model update No versioning of prompt-model pairs Prompt registry with model version pinning, integration tests
Shadow production drift Staging doesn't predict production quality Staging data differs from production Production traffic shadowing, online eval with guardrails
Eval regression undetected Shipped regression on long-tail queries Eval suite too small Growing eval set from production failures, stratified eval
Secret sprawl API keys hardcoded in configs No secrets management Vault/secrets manager, environment variables, key rotation

◆ Hands-On Exercises

Exercise 1: Build a Prompt Versioning System

Goal: Set up prompt versioning with A/B test capability Time: 30 minutes Steps: 1. Create a prompt registry (JSON/YAML config or database) 2. Implement version pinning (prompt v1 + model gpt-4o = deployment A) 3. Add A/B routing (50/50 split between prompt v1 and v2) 4. Log quality scores per variant Expected Output: A/B test results showing prompt version comparison


Type Resource Why
📘 Book "AI Engineering" by Chip Huyen (2025), Ch 8-9 Definitive treatment of LLMOps patterns
📘 Book "Designing Machine Learning Systems" by Chip Huyen (2022) MLOps foundations that LLMOps builds on
🔧 Hands-on LangSmith Documentation Production LLM observability and evaluation platform
🎥 Video Chip Huyen — "Building LLM Applications for Production" Practical LLMOps talk covering common pitfalls

★ Sources

  • LangSmith documentation — https://docs.smith.langchain.com
  • Langfuse documentation — https://langfuse.com/docs
  • Portkey AI Gateway — https://portkey.ai/docs
  • Arize Phoenix — https://docs.arize.com/phoenix
  • Hamel Husain, "Your AI Product Needs Evals" (2024)