Docker & Kubernetes for GenAI Deployment¶

Containers make GenAI workloads reproducible. Kubernetes makes them operable when one container is no longer enough.

★ TL;DR¶

What: The core deployment stack for packaging, shipping, and scaling AI services.
Why: Most production AI systems fail on environment drift, weak rollout practices, or poor scaling long before they fail on model quality.
Key point: Use Docker to standardize runtime; use Kubernetes when you need repeatable multi-instance operations, autoscaling, and infrastructure policy.

★ Overview¶

Definition¶

Docker packages an application and its dependencies into a portable container image. Kubernetes schedules and manages those containers across a cluster.

Scope¶

This note covers the practical concepts an AI engineer needs: image design, container boundaries, Kubernetes objects, GPU scheduling, and deployment patterns. For model-specific request handling, see Model Serving for LLM Applications.

Significance¶

Self-hosted GenAI almost always becomes a container problem before it becomes a model problem.
Hiring teams use Docker and Kubernetes as a proxy for "can this person ship systems, not just notebooks?"
Good container discipline improves security, reproducibility, and rollback speed.

Prerequisites¶

★ Deep Dive¶

Why Containers Matter for AI¶

AI apps are unusually dependency-heavy:

CUDA and driver compatibility matter
model weights can be large and sensitive to path/layout assumptions
Python environments drift easily across laptops, CI, and servers
serving stacks often combine API code, model runtime, queues, and observability agents

Containers give you a predictable runtime boundary.

Docker Building Blocks¶

Concept	What It Does	Why It Matters
Image	Immutable packaged filesystem + config	What you build and deploy
Container	Running instance of an image	What actually serves traffic
Dockerfile	Build recipe	Encodes reproducibility
Registry	Stores images	Enables CI/CD and rollbacks
Volume	Persistent storage	Avoid baking mutable data into images
Network	Container connectivity	Important for gateways, vector DBs, and tracing

Container Design Rules for GenAI¶

Keep the image focused on one responsibility, for example API server or worker.
Do not bake secrets into images.
Prefer pulling large model weights at startup or mounting them from managed storage.
Separate build-time dependencies from runtime dependencies with multi-stage builds.
Make health checks explicit.

Example Dockerfile for a FastAPI LLM Gateway¶

FROM python:3.12-slim AS base

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENV PYTHONUNBUFFERED=1
EXPOSE 8000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Core Kubernetes Objects¶

Object	Purpose	AI Example
Pod	Smallest deployable unit	One API server or inference worker
Deployment	Manages rolling updates for stateless workloads	LLM gateway replicas
Service	Stable internal network endpoint	Route traffic to pods
ConfigMap	Non-secret config	Feature flags, model route settings
Secret	Sensitive config	API keys, database credentials
Job/CronJob	Batch or scheduled work	Offline eval runs, embedding sync
HorizontalPodAutoscaler	Adjust replica count	Scale API gateways on traffic
Ingress/Gateway	External traffic entry	Public API or internal portal

Deployment Patterns¶

Pattern	When To Use	Notes
Single container on VM	Early prototype or low-traffic internal app	Lowest ops overhead
Docker Compose	Local integration testing	Good for app + vector DB + tracing stack
Kubernetes deployment	Multi-service or team-managed production	Standard platform choice
Kubernetes + GPU nodes	Self-hosted model serving	Requires GPU scheduling and cost controls

GPU Scheduling in Kubernetes¶

For self-hosted inference, the platform typically includes:

GPU node pools
device plugins from the hardware vendor
node selectors or taints/tolerations to isolate GPU workloads
autoscaling policies to avoid idle expensive nodes

A common split is:

CPU pods for orchestration, retrieval, and API layers
GPU pods for inference servers

Minimal Kubernetes Deployment Shape¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-gateway
  template:
    metadata:
      labels:
        app: llm-gateway
    spec:
      containers:
        - name: api
          image: ghcr.io/example/llm-gateway:1.0.0
          ports:
            - containerPort: 8000
          readinessProbe:
            httpGet:
              path: /health
              port: 8000

When Kubernetes Is Worth It¶

Use Kubernetes when you need several of these at once:

multiple services with independent scaling
reliable rollouts and rollbacks
cluster-wide policy and observability
scheduled jobs and background workers
shared platform conventions across teams

Do not adopt Kubernetes only because it feels "more production."

◆ Quick Reference¶

Situation	Better First Move
Shipping a prototype	Docker on one VM or managed platform
Reproducing dev and CI environments	Docker
Running several services together locally	Docker Compose
Rolling updates and autoscaling	Kubernetes Deployment + HPA
Offline evaluation jobs	Kubernetes Job or CronJob
Expensive GPU workloads	Separate GPU node pools and strict autoscaling

○ Gotchas & Common Mistakes¶

Large model downloads can make pod startup painfully slow; plan warmup and image strategy.
"One huge container" creates noisy failure domains and painful deploys.
Kubernetes does not fix bad serving architecture; it only manages it.
GPU cost can explode if autoscaling and idle shutdown are weak.

○ Interview Angles¶

Q: When would you choose Kubernetes for a GenAI system?
A: When the system has multiple independently scaled services, controlled rollouts, background jobs, observability requirements, or self-hosted inference that needs GPU scheduling. For smaller systems, a managed platform or simple container deployment may be better.
Q: What is the main Docker benefit for AI teams?
A: Reproducibility. It removes "works on my machine" failures across notebooks, CI, and production while standardizing dependencies and rollout behavior.

★ Code & Implementation¶

Containerize a FastAPI LLM Service¶

# Dockerfile — production LLM API service
# ⚠️ Last tested: 2026-04 | Requires: Docker 24+
FROM python:3.12-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENV PORT=8080
EXPOSE 8080

# Non-root user for security
RUN useradd -m appuser && chown -R appuser /app
USER appuser

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

# main.py — FastAPI LLM endpoint
# pip install fastapi>=0.110 uvicorn>=0.29 openai>=1.60 pydantic>=2
# ⚠️ Last tested: 2026-04
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI, APIError
import os

app    = FastAPI(title="LLM API")
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

class ChatRequest(BaseModel):
    message: str
    model: str = "gpt-4o-mini"
    max_tokens: int = 200

class ChatResponse(BaseModel):
    response: str
    model: str
    tokens_used: int

@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest) -> ChatResponse:
    try:
        resp = client.chat.completions.create(
            model=req.model,
            messages=[{"role": "user", "content": req.message}],
            max_tokens=req.max_tokens,
        )
    except APIError as e:
        raise HTTPException(status_code=502, detail=str(e))
    return ChatResponse(
        response=resp.choices[0].message.content,
        model=resp.model,
        tokens_used=resp.usage.total_tokens,
    )

@app.get("/health")
async def health() -> dict:
    return {"status": "ok"}

# docker-compose.yml for local development + testing
version: "3.9"
services:
  llm-api:
    build: .
    ports: ["8080:8080"]
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 5s
      retries: 3

★ Connections¶

Relationship	Topics
Builds on	LLMOps & Production Deployment, AI System Design for GenAI Applications
Leads to	Model Serving for LLM Applications, Monitoring & Observability for GenAI Systems, CI/CD for ML and LLM Systems
Compare with	Managed PaaS deployment, serverless inference
Cross-domain	DevOps, platform engineering, SRE

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
GPU scheduling failures	Pods stuck in Pending, no GPU assigned	Insufficient GPU node pool, no resource quotas	Node auto-scaling, quotas, GPU sharing (MIG, time-slicing)
Image size explosion	15GB+ container images, slow pulls	CUDA runtime + model weights in image	Multi-stage builds, model weights via volume mount
OOM kills during inference	Container killed mid-request	Memory limit too low for model + KV-cache	Profile actual memory, set limits 20% above peak
Health check false positives	K8s restarts healthy pods	Health check doesn't verify GPU readiness	Custom health endpoint with test inference

◆ Hands-On Exercises¶

Exercise 1: Containerize a Model Server¶

Goal: Build an optimized Docker image for LLM serving and deploy to K8s Time: 45 minutes Steps: 1. Write a multi-stage Dockerfile (build deps, then runtime-only) 2. Mount model weights as a volume (not baked into image) 3. Deploy to minikube with GPU resource requests 4. Test horizontal pod autoscaling based on request queue depth Expected Output: Running pod with GPU access, image size under 2GB

★ Recommended Resources¶

Type	Resource	Why
🔧 Hands-on	Docker Official Documentation	Container fundamentals for ML deployment
🔧 Hands-on	Kubernetes for ML (Kubeflow)	ML-specific Kubernetes orchestration
📘 Book	"Kubernetes in Action" by Luksa (2020)	Comprehensive K8s reference
🎥 Video	TechWorld with Nana — "Docker + K8s"	Best beginner-friendly container tutorials

★ Sources¶

Docker documentation - https://docs.docker.com
Kubernetes documentation - https://kubernetes.io/docs
NVIDIA Kubernetes device plugin documentation
LLMOps & Production Deployment