LLMOps & Production Deployment¶

✨ Bit: Anyone can call an API in a notebook. Getting that same API to serve 10,000 users reliably, cheaply, safely, and without hallucinating financial advice? That's LLMOps. It's the difference between a demo and a product.

★ TL;DR¶

What: The practices, tools, and pipelines for deploying, monitoring, and maintaining LLM applications in production
Why: 90% of GenAI projects fail to reach production. LLMOps is what separates "cool prototype" from "reliable product"
Key point: LLMs are non-deterministic, expensive, and can hallucinate. You need different operational practices than traditional software.

★ Overview¶

Definition¶

LLMOps extends MLOps and DevOps to address the unique challenges of LLM applications: non-deterministic outputs, prompt management, token cost tracking, hallucination monitoring, and safety guardrails — all while maintaining the reliability users expect.

Scope¶

Covers the production lifecycle. For deployment packaging, see Docker & Kubernetes for GenAI Deployment. For runtime design, see Model Serving for LLM Applications. For tracing and ops telemetry, see Monitoring & Observability for GenAI Systems. For release automation, see CI/CD for ML and LLM Systems. For economics, see Cost Optimization for GenAI Systems. For lower-level optimization, see Inference Optimization.

Significance¶

This is what "senior GenAI engineer" actually means in job descriptions
Companies are investing more in LLMOps than in model training
Understanding this = you can build AND ship, not just experiment

★ Deep Dive¶

The LLMOps Stack¶

┌─────────────────────────────────────────────────────────┐
│                    USER REQUESTS                         │
├─────────────────────────────────────────────────────────┤
│  GATEWAY LAYER                                          │
│  Rate limiting │ Auth │ Request routing │ Load balancing │
├─────────────────────────────────────────────────────────┤
│  GUARDRAILS (Input)                                     │
│  Prompt injection detection │ PII scrubbing │ Validation │
├─────────────────────────────────────────────────────────┤
│  APPLICATION LOGIC                                      │
│  RAG pipeline │ Agent loops │ Chain orchestration        │
├────────────────────┬────────────────────────────────────┤
│  LLM LAYER         │  CACHE LAYER                       │
│  API calls │       │  Semantic cache │ Exact cache       │
│  Self-hosted       │  (save $$$ on repeated queries)    │
├────────────────────┴────────────────────────────────────┤
│  GUARDRAILS (Output)                                    │
│  Hallucination check │ Toxicity filter │ PII detection  │
├─────────────────────────────────────────────────────────┤
│  OBSERVABILITY                                          │
│  Tracing │ Logging │ Metrics │ Cost tracking │ Alerts   │
├─────────────────────────────────────────────────────────┤
│  EVALUATION (Continuous)                                │
│  Automated evals │ Human feedback │ Regression testing  │
└─────────────────────────────────────────────────────────┘

Prompt Management¶

THE PROBLEM:
  Prompts are like code — they need version control.
  But they're stored in strings, not files.
  One bad prompt change can break everything.

SOLUTION: Treat prompts as first-class artifacts

  ┌──────────────────────────────────────────────┐
  │  PROMPT LIFECYCLE                            │
  │                                              │
  │  1. Write prompt (with template variables)   │
  │  2. Test against eval suite (golden examples)│
  │  3. Version it (v1.0, v1.1, v2.0)           │
  │  4. A/B test in production (v1 vs v2)        │
  │  5. Monitor quality metrics                  │
  │  6. Rollback if quality drops                │
  └──────────────────────────────────────────────┘

TOOLS:
  - LangSmith (LangChain) — prompt playground + versioning
  - Braintrust — prompt testing + A/B testing
  - Promptfoo — CLI-based prompt testing
  - Portkey — AI gateway with prompt management

Monitoring & Observability¶

What to Monitor	Why	Tool
Latency (TTFT, total)	User experience	Langfuse, LangSmith
Token usage	Cost control	Portkey, custom logging
Error rates	Reliability	Any APM + custom
Quality scores	Hallucination, relevance	RAGAS, DeepEval
Cost per query	Budget management	Portkey, custom
Guardrail triggers	Safety monitoring	NeMo, Lakera
User feedback	Ground truth	Custom (👍/👎 buttons)
Drift	Performance degradation over time	Arize Phoenix

# ⚠️ Last tested: 2026-04
# ═══ Basic LLM Observability with Langfuse ═══
from langfuse.openai import openai  # Drop-in replacement

# Every call is now automatically traced
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain RAG"}],
    metadata={"user_id": "user_123", "session": "abc"}
)
# → Langfuse dashboard shows: latency, tokens, cost, trace

# ═══ Semantic Caching (reduce costs 30-60%) ═══
# If a similar question was asked before, return cached answer
from gptcache import cache
cache.init()  # Initialize semantic cache
# Similar questions → cache hit → save tokens + latency

Deployment Patterns¶

Pattern	When	Pros	Cons
API-only (OpenAI, Anthropic)	Fast start, simple	Easy, no infra	Cost at scale, vendor lock
API + Gateway (Portkey, LiteLLM)	Multi-model, production	Fallbacks, load balancing	Extra layer
Self-hosted (vLLM + open model)	Data privacy, cost control	Full control, no vendor	GPU infra needed
Hybrid (API for hard, self-host for easy)	Cost optimization	Best of both	Complex routing

DEPLOYMENT CHECKLIST:
  □ Rate limiting (protect against abuse)
  □ API key management (rotate, scope)
  □ Retry logic with exponential backoff
  □ Fallback models (if primary fails)
  □ Cost alerts (daily/monthly budgets)
  □ Response logging (for debugging + eval)
  □ User feedback collection (👍/👎)
  □ Automated eval suite (run on every change)
  □ Guardrails (input + output)
  □ Health checks and uptime monitoring

CI/CD for LLM Applications¶

TRADITIONAL CI/CD:
  Code change → Run tests → Deploy if tests pass

LLM CI/CD (additional steps):
  Prompt change → Run eval suite → Compare with baseline
                → If better → Deploy (canary)
                → If worse → Block deployment

  ┌───────────────────────────────────────────────────┐
  │  PROMPT/CODE CHANGE                               │
  │       │                                           │
  │       ▼                                           │
  │  ┌─────────────┐                                  │
  │  │ Unit Tests   │ ← Traditional code tests        │
  │  └──────┬──────┘                                  │
  │         ▼                                         │
  │  ┌─────────────┐                                  │
  │  │ LLM Eval     │ ← Run prompt against golden set │
  │  │ Suite        │   Compare quality scores         │
  │  └──────┬──────┘                                  │
  │         ▼                                         │
  │  ┌─────────────┐                                  │
  │  │ Regression   │ ← Did any existing answers get  │
  │  │ Check        │   worse? (output diff analysis)  │
  │  └──────┬──────┘                                  │
  │         ▼                                         │
  │  ┌─────────────┐                                  │
  │  │ Cost Check   │ ← Is the new prompt more         │
  │  │              │   expensive? Within budget?       │
  │  └──────┬──────┘                                  │
  │         ▼                                         │
  │  DEPLOY (canary → 5% → 50% → 100%)               │
  └───────────────────────────────────────────────────┘

Observability Platforms (2026)¶

Platform	Type	Best For
LangSmith	SaaS	LangChain users, full lifecycle
Langfuse	Open-source	Self-hosted, privacy-first
Arize Phoenix	Open-source	Drift monitoring, traces
Portkey	AI Gateway	Multi-model routing, cost tracking
Braintrust	SaaS	Eval + prompt management
Maxim AI	SaaS	Enterprise observability
Helicone	SaaS	Simple logging + analytics

◆ Quick Reference¶

COST REDUCTION STRATEGIES:
  1. Semantic caching (30-60% savings)
  2. Smaller models for simple tasks (GPT-5.4-mini/nano vs GPT-5.4)
  3. Prompt optimization (fewer tokens)
  4. Batching requests (where possible)
  5. Self-host for high-volume (break-even ~$5K/month)

INCIDENT RESPONSE:
  Model returns gibberish  → Check API status, switch to fallback
  Costs spike unexpectedly → Check for prompt injection, rate limit
  Quality drops suddenly   → API model updated? Check eval scores
  Guardrail trigger surge  → Possible attack, review logs

KEY METRICS:
  TTFT (time to first token) < 500ms for interactive
  Total latency < 5s for most queries
  Error rate < 0.1%
  Cost per query: track and budget
  Eval score regression: < 5% acceptable

○ Gotchas & Common Mistakes¶

⚠️ No eval suite = deploying blind: You MUST have a set of golden test cases to catch regressions.
⚠️ LLM APIs change without warning: OpenAI/Anthropic update models silently. Your app can break overnight. Monitor quality continuously.
⚠️ Logging everything is expensive: Log smartly — sample in production, log fully in staging.
⚠️ Prompt injection is a real attack: Users WILL try to override your system prompt. Always validate.
⚠️ Vendor lock-in is real: Abstract your LLM calls behind an interface. Use gateways like Portkey/LiteLLM.

○ Interview Angles¶

Q: How would you take an LLM prototype to production?
A: (1) Create an eval suite (50+ golden examples), (2) Add input/output guardrails, (3) Implement observability (Langfuse/LangSmith), (4) Set up cost alerting, (5) Abstract the LLM provider behind a gateway for fallbacks, (6) CI/CD pipeline that runs eval suite on every prompt/code change, (7) Canary deployment with quality monitoring.
Q: How do you handle LLM quality degradation in production?
A: Continuous monitoring via automated evals, user feedback (👍/👎), drift detection. When quality drops: check if the provider updated the model, run regression analysis against golden set, roll back prompts if needed, or switch to a backup model.

★ Code & Implementation¶

LLM Call Tracker (Latency + Token Logging)¶

# pip install openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY env var
import time, uuid, logging
from openai import OpenAI

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s")
log = logging.getLogger("llmops")
client = OpenAI()

def tracked_completion(messages: list[dict], model: str = "gpt-4o-mini", **kw) -> str:
    """Production-instrumented LLM call with trace/latency/token logging."""
    trace_id = str(uuid.uuid4())[:8]
    start = time.monotonic()
    resp = client.chat.completions.create(model=model, messages=messages, **kw)
    latency_ms = (time.monotonic() - start) * 1000
    u = resp.usage
    log.info(
        "llm | trace=%s model=%s prompt_tok=%d completion_tok=%d total_tok=%d latency_ms=%.1f",
        trace_id, model, u.prompt_tokens, u.completion_tokens, u.total_tokens, latency_ms,
    )
    return resp.choices[0].message.content

result = tracked_completion(
    [{"role": "user", "content": "Summarize RAG in one sentence."}],
    max_tokens=80, temperature=0.3,
)
print(result)

A/B Prompt Experiment¶

# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY
import statistics

def ab_eval(prompt_a: str, prompt_b: str, test_inputs: list[str]) -> None:
    for label, system in [("A", prompt_a), ("B", prompt_b)]:
        latencies, tokens = [], []
        for user_input in test_inputs:
            start = time.monotonic()
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "system", "content": system},
                          {"role": "user",   "content": user_input}],
                max_tokens=150, temperature=0,
            )
            latencies.append((time.monotonic() - start) * 1000)
            tokens.append(resp.usage.total_tokens)
        print(f"Variant {label}: median_ms={statistics.median(latencies):.0f} "
              f"avg_tokens={statistics.mean(tokens):.1f}")

ab_eval(
    "You are a concise assistant. Answer in one sentence.",
    "You are a helpful assistant.",
    ["What is RAG?", "What is LoRA?", "What is MoE?"],
)

★ Connections¶

Relationship	Topics
Builds on	Llms Overview, Evaluation And Benchmarks, Ethics Safety Alignment
Leads to	Enterprise AI deployment, Scalable AI systems
Compare with	Traditional MLOps (ML models), DevOps (software)
Cross-domain	Site Reliability Engineering, Platform engineering

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Prompt-model version skew	Prompt templates break after model update	No versioning of prompt-model pairs	Prompt registry with model version pinning, integration tests
Shadow production drift	Staging doesn't predict production quality	Staging data differs from production	Production traffic shadowing, online eval with guardrails
Eval regression undetected	Shipped regression on long-tail queries	Eval suite too small	Growing eval set from production failures, stratified eval
Secret sprawl	API keys hardcoded in configs	No secrets management	Vault/secrets manager, environment variables, key rotation

◆ Hands-On Exercises¶

Exercise 1: Build a Prompt Versioning System¶

Goal: Set up prompt versioning with A/B test capability Time: 30 minutes Steps: 1. Create a prompt registry (JSON/YAML config or database) 2. Implement version pinning (prompt v1 + model gpt-4o = deployment A) 3. Add A/B routing (50/50 split between prompt v1 and v2) 4. Log quality scores per variant Expected Output: A/B test results showing prompt version comparison

★ Recommended Resources¶

Type	Resource	Why
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 8-9	Definitive treatment of LLMOps patterns
📘 Book	"Designing Machine Learning Systems" by Chip Huyen (2022)	MLOps foundations that LLMOps builds on
🔧 Hands-on	LangSmith Documentation	Production LLM observability and evaluation platform
🎥 Video	Chip Huyen — "Building LLM Applications for Production"	Practical LLMOps talk covering common pitfalls

★ Sources¶

LangSmith documentation — https://docs.smith.langchain.com
Langfuse documentation — https://langfuse.com/docs
Portkey AI Gateway — https://portkey.ai/docs
Arize Phoenix — https://docs.arize.com/phoenix
Hamel Husain, "Your AI Product Needs Evals" (2024)