Skip to content

GenAI Tools & Infrastructure

Bit: The model is 10% of the work. The infrastructure around it is the other 90%. Welcome to production.


★ TL;DR

  • What: The ecosystem of frameworks, databases, serving engines, and platforms used to build GenAI applications
  • Why: Knowing models is theory. Knowing the tooling is what gets you hired and makes things work in production.
  • Key point: The stack is converging: Orchestration (LangChain/LlamaIndex) + Vector DB + Serving Engine + Observability

★ Overview

Definition

GenAI infrastructure encompasses everything between "I have a model" and "I have a production application" — orchestration frameworks, vector databases, model serving engines, evaluation tools, and observability platforms.

Scope

This is the overview/index document. Deep dives on individual tools are in sub-documents: - Vector Databases - Pinecone, Weaviate, Qdrant, Chroma, pgvector - Cloud ML Services & Managed AI Platforms - Distributed Systems Fundamentals for AI - ML Experiment & Data Management - ML Experiment & Data Management - For orchestration + RAG code, also see Rag

Significance

  • This is where deep tech separates from wrapper dev
  • Understanding infra = you can architect systems, not just call APIs
  • The layer where most production problems live (latency, cost, reliability)

Prerequisites

  • Llms Overview — what you're serving/orchestrating
  • Rag — primary use case for most tools

★ Deep Dive

The GenAI Application Stack

┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                         │
│  Chat UI │ API Endpoints │ Slack/Teams Bot │ Internal Tools  │
├─────────────────────────────────────────────────────────────┤
│                    ORCHESTRATION LAYER                       │
│  LangChain │ LlamaIndex │ Semantic Kernel │ Custom Code     │
├──────────────────────┬──────────────────────────────────────┤
│   RETRIEVAL          │         GENERATION                    │
│  ┌────────────────┐  │  ┌────────────────────────────────┐  │
│  │ Vector DB      │  │  │ Model API / Self-hosted LLM    │  │
│  │ (Pinecone,     │  │  │ (OpenAI, Anthropic, vLLM,      │  │
│  │  Weaviate,     │  │  │  Ollama, TGI)                  │  │
│  │  Qdrant)       │  │  │                                │  │
│  ├────────────────┤  │  ├────────────────────────────────┤  │
│  │ Embedding      │  │  │ Guardrails / Safety            │  │
│  │ Models         │  │  │ (NeMo, Guardrails AI)          │  │
│  └────────────────┘  │  └────────────────────────────────┘  │
├──────────────────────┴──────────────────────────────────────┤
│                    OBSERVABILITY & EVAL                      │
│  LangSmith │ Weights & Biases │ Phoenix │ RAGAS │ DeepEval  │
├─────────────────────────────────────────────────────────────┤
│                    COMPUTE / INFRA                           │
│  GPU Cloud (AWS, GCP, Azure) │ Serverless │ On-prem         │
└─────────────────────────────────────────────────────────────┘

Tool Categories & Top Picks

1. Orchestration Frameworks

Framework Language Strengths Best For
LangChain Python/JS Largest ecosystem, most integrations General purpose, RAG, chains
LlamaIndex Python Best for data/RAG, structured indexing Data-heavy apps, enterprise search
LangGraph Python Stateful agent graphs (by LangChain) Complex agents, workflows
Semantic Kernel C#/Python Microsoft ecosystem, enterprise .NET shops, Azure-first
Haystack Python Clean API, production-focused Search/RAG pipelines

When to use what:

"I need a quick RAG prototype"       → LlamaIndex
"I need complex agent workflows"     → LangGraph
"I need maximum flexibility/control" → LangChain
"I'm in the Microsoft ecosystem"     → Semantic Kernel
"I want minimal abstraction"         → Direct API calls + custom code

2. Model Serving & Inference

Engine Use Case Key Feature
vLLM Self-host LLMs PagedAttention, fastest open-source inference
Ollama Local development Run LLMs locally with one command
TGI (HuggingFace) Production serving Docker-ready, HF integration
TensorRT-LLM NVIDIA GPUs Best NVIDIA optimization
llama.cpp CPU / Edge Run quantized models on CPU
SGLang High throughput RadixAttention, constrained decoding
# Run LLaMA locally with Ollama (simplest start)
ollama run llama3.2

# Serve with vLLM (production)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-8B \
  --tensor-parallel-size 2

3. Evaluation & Observability

Tool Purpose Key Feature
LangSmith Tracing, eval, monitoring LangChain native, best debugging
RAGAS RAG evaluation Automated faithfulness/relevance metrics
DeepEval LLM testing Unit tests for LLM outputs
Phoenix (Arize) Observability Open-source tracing
Weights & Biases Experiment tracking ML experiment management
Braintrust Eval + logging Prompt playground + eval

4. Platforms (Managed)

Platform What It Provides
Hugging Face Model hub, Spaces, Inference API, datasets
Replicate One-click model deployment
Together AI Fast API for open models
Fireworks AI Fastest open model serving
AWS Bedrock Managed access to multiple models
Google Vertex AI Gemini + model garden + fine-tuning
Azure AI Studio OpenAI models + enterprise features

◆ Types & Classifications

By Deployment Pattern

How to Serve LLMs
├── API (Managed)
│   ├── Direct: OpenAI, Anthropic, Google APIs
│   └── Aggregator: Together AI, Fireworks, Replicate
├── Self-Hosted (Your Infra)
│   ├── vLLM / TGI (GPU server)
│   ├── Ollama (local dev)
│   └── llama.cpp (CPU/edge)
└── Hybrid
    ├── Cloud GPU (RunPod, Lambda, AWS)
    └── On-prem + cloud burst

Cost Decision Matrix

Scenario Best Choice Why
Prototyping API (OpenAI/Claude) Fast start, no infra
< 100K tokens/day API Cheaper than running a GPU
100K-1M tokens/day Evaluate both Crossover point
> 1M tokens/day Self-host (vLLM) API costs explode
Privacy-critical Self-host Data stays on your infra
Latency-critical Self-host + optimize Control over serving

◆ Quick Reference

STARTER STACK (prototype):
  LLM:        OpenAI API / Claude API
  Framework:  LangChain or LlamaIndex
  Vector DB:  Chroma (local) or Pinecone (managed)
  Eval:       Manual + RAGAS

PRODUCTION STACK:
  LLM:        vLLM (self-host) or API with fallback
  Framework:  LangGraph / custom orchestration
  Vector DB:  Qdrant or Weaviate (self-host) or Pinecone
  Eval:       LangSmith + RAGAS + custom metrics
  Monitor:    LangSmith or Phoenix
  Guardrails: NeMo Guardrails or custom

BUDGET STACK (learning / hobby):
  LLM:        Ollama (local) + free tier APIs
  Framework:  LangChain
  Vector DB:  Chroma (embedded)
  Eval:       Manual testing

○ Gotchas & Common Mistakes

  • ⚠️ Framework lock-in: LangChain abstractions are convenient but can hide important details. Understand what's happening underneath.
  • ⚠️ "Just use the API" at scale: At 1M+ tokens/day, API costs can be $1000+/month. Do the math before committing.
  • ⚠️ Ignoring evaluation: Most teams ship GenAI without measuring quality. Build eval into your pipeline from day 1.
  • ⚠️ Ollama in production: Ollama is for dev, not production serving. Use vLLM or TGI for production workloads.
  • ⚠️ Vector DB hype: For < 100K documents, pgvector (Postgres extension) is probably enough. Don't over-architect.

○ Interview Angles

  • Q: How would you architect a production RAG system?
  • A: LLM via API (with fallback), vector DB (Qdrant/Pinecone) with hybrid search, LangChain/LlamaIndex for orchestration, LangSmith for tracing, RAGAS for eval. Add caching layer for repeated queries, rate limiting, and graceful degradation when LLM is unavailable.

  • Q: When would you self-host vs use an API?

  • A: Self-host when: high volume (cost), privacy requirements (data governance), latency needs (no network hop), or need fine-tuned open models. Use API when: low volume, need best quality (GPT-5/Claude 4), fast iteration, no GPU infra.

★ Code & Implementation

LangChain vs Direct API Comparison

# pip install openai>=1.60 langchain>=0.2 langchain-openai>=0.1
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, langchain>=0.2, OPENAI_API_KEY

# ═══ DIRECT OPENAI API (recommended for simple cases) ═══
from openai import OpenAI
client = OpenAI()

def direct_rag(query: str, docs: list[str]) -> str:
    context = "\n\n".join(docs)
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Answer from context:\n{context}"},
            {"role": "user",   "content": query},
        ],
        max_tokens=200,
    ).choices[0].message.content

# ═══ LANGCHAIN (for complex pipelines, RAG chains, agents) ═══
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

lc_model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def langchain_call(query: str) -> str:
    messages = [
        SystemMessage(content="You are a concise assistant."),
        HumanMessage(content=query),
    ]
    return lc_model.invoke(messages).content

# Compare outputs
docs = ["RAG combines retrieval with LLM generation to ground answers in real context."]
print("Direct:", direct_rag("What is RAG?", docs))
print("LangChain:", langchain_call("What is RAG in one sentence?"))
# Key insight: direct API = less abstraction, fewer deps, easier debugging
# LangChain = worth it when you need: memory, chains, agents, callbacks

★ Connections

Relationship Topics
Builds on Llms Overview, Rag
Leads to Production GenAI systems, MLOps
Compare with Traditional ML infra (MLflow, Kubeflow)
Cross-domain DevOps, Cloud architecture, Systems design

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Tool sprawl Team uses 5 different experiment trackers, no standard No standardized toolchain decision Document standard stack, enforce via CI/CD templates
Version incompatibility LangChain update breaks production pipeline No pinned dependencies, no integration tests Pin versions, test upgrades in staging, use lockfiles
Framework lock-in Cannot switch from LangChain to LlamaIndex without rewrite Tight coupling to framework internals Abstract LLM calls behind interface, minimize framework surface

◆ Hands-On Exercises

Exercise 1: Evaluate a GenAI Stack

Goal: Assess and document a complete GenAI toolchain for a use case Time: 30 minutes Steps: 1. Pick a use case (e.g., internal knowledge assistant) 2. Select tools for each layer: LLM, embedding, vector DB, orchestration, serving 3. Document tradeoffs for each selection (cost, lock-in, maturity) 4. Create a stack diagram with version pins Expected Output: One-page stack decision document with rationale


Type Resource Why
📘 Book "AI Engineering" by Chip Huyen (2025) Covers the full AI tooling landscape
🔧 Hands-on HuggingFace Ecosystem Central hub for models, datasets, and tools
🔧 Hands-on LangChain Documentation Comprehensive LLM application framework

★ Sources

  • LangChain documentation — https://docs.langchain.com
  • LlamaIndex documentation — https://docs.llamaindex.ai
  • vLLM documentation — https://docs.vllm.ai
  • Ollama — https://ollama.com
  • Hugging Face Hub — https://huggingface.co