GenAI infrastructure encompasses everything between "I have a model" and "I have a production application" — orchestration frameworks, vector databases, model serving engines, evaluation tools, and observability platforms.
"I need a quick RAG prototype" → LlamaIndex
"I need complex agent workflows" → LangGraph
"I need maximum flexibility/control" → LangChain
"I'm in the Microsoft ecosystem" → Semantic Kernel
"I want minimal abstraction" → Direct API calls + custom code
# Run LLaMA locally with Ollama (simplest start)ollamarunllama3.2
# Serve with vLLM (production)python-mvllm.entrypoints.openai.api_server\--modelmeta-llama/Llama-3.2-8B\--tensor-parallel-size2
Q: How would you architect a production RAG system?
A: LLM via API (with fallback), vector DB (Qdrant/Pinecone) with hybrid search, LangChain/LlamaIndex for orchestration, LangSmith for tracing, RAGAS for eval. Add caching layer for repeated queries, rate limiting, and graceful degradation when LLM is unavailable.
Q: When would you self-host vs use an API?
A: Self-host when: high volume (cost), privacy requirements (data governance), latency needs (no network hop), or need fine-tuned open models. Use API when: low volume, need best quality (GPT-5/Claude 4), fast iteration, no GPU infra.
# pip install openai>=1.60 langchain>=0.2 langchain-openai>=0.1# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, langchain>=0.2, OPENAI_API_KEY# ═══ DIRECT OPENAI API (recommended for simple cases) ═══fromopenaiimportOpenAIclient=OpenAI()defdirect_rag(query:str,docs:list[str])->str:context="\n\n".join(docs)returnclient.chat.completions.create(model="gpt-4o-mini",messages=[{"role":"system","content":f"Answer from context:\n{context}"},{"role":"user","content":query},],max_tokens=200,).choices[0].message.content# ═══ LANGCHAIN (for complex pipelines, RAG chains, agents) ═══fromlangchain_openaiimportChatOpenAIfromlangchain.schemaimportHumanMessage,SystemMessagelc_model=ChatOpenAI(model="gpt-4o-mini",temperature=0)deflangchain_call(query:str)->str:messages=[SystemMessage(content="You are a concise assistant."),HumanMessage(content=query),]returnlc_model.invoke(messages).content# Compare outputsdocs=["RAG combines retrieval with LLM generation to ground answers in real context."]print("Direct:",direct_rag("What is RAG?",docs))print("LangChain:",langchain_call("What is RAG in one sentence?"))# Key insight: direct API = less abstraction, fewer deps, easier debugging# LangChain = worth it when you need: memory, chains, agents, callbacks
Goal: Assess and document a complete GenAI toolchain for a use case
Time: 30 minutes
Steps:
1. Pick a use case (e.g., internal knowledge assistant)
2. Select tools for each layer: LLM, embedding, vector DB, orchestration, serving
3. Document tradeoffs for each selection (cost, lock-in, maturity)
4. Create a stack diagram with version pins
Expected Output: One-page stack decision document with rationale