Retrieval-Augmented Generation (RAG)¶

✨ Bit: RAG is like giving the LLM an open-book exam instead of asking it to recall everything from memory. Turns out, even AI does better with notes.

★ TL;DR¶

What: A pattern that retrieves relevant external documents and feeds them to an LLM as context before generation
Why: Fixes hallucination, enables up-to-date answers, works with private data — WITHOUT retraining the model
Key point: The dominant technique for enterprise GenAI. If you're building GenAI products, you ARE building RAG pipelines.

★ Overview¶

Definition¶

RAG (Retrieval-Augmented Generation) is an architecture that combines information retrieval with text generation. Instead of relying solely on the LLM's parametric knowledge (what it memorized during training), RAG retrieves relevant documents from an external knowledge base at query time and includes them in the prompt.

Scope¶

This document covers RAG architecture, pipeline components, and advanced patterns. For embedding models and vector databases, see Vector Databases. For combining RAG with fine-tuning, see Fine Tuning.

Significance¶

Most deployed GenAI pattern in production (2024-2026)
Solves: Hallucination, knowledge cutoff, private data access
Doesn't require: Retraining or fine-tuning the LLM
Industry standard for: Enterprise Q&A, customer support, document intelligence

Prerequisites¶

Llms Overview — what LLMs are and how they work
Basic understanding of embeddings (vector representations of text)

★ Deep Dive¶

The Basic RAG Pipeline¶

┌──────────────────────────────────────────────────────────────┐
│                      RAG PIPELINE                            │
│                                                              │
│  INDEXING (one-time / periodic)                              │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌─────────┐  │
│  │ Documents│ → │ Chunking │ → │ Embedding│ → │ Vector  │  │
│  │ (raw)    │   │ (split)  │   │ (encode) │   │ DB      │  │
│  └──────────┘   └──────────┘   └──────────┘   └─────────┘  │
│                                                              │
│  RETRIEVAL + GENERATION (per query)                          │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌─────────┐  │
│  │ User     │ → │ Embed    │ → │ Search   │ → │ Top-K   │  │
│  │ Query    │   │ Query    │   │ Vector DB│   │ Chunks  │  │
│  └──────────┘   └──────────┘   └──────────┘   └────┬────┘  │
│                                                      │       │
│  ┌──────────────────────────────────────────────────┐│       │
│  │ PROMPT = System Instructions + Retrieved Chunks + Query  ││
│  └──────────────────────────────────────┬───────────┘│       │
│                                         ↓            │       │
│                                    ┌─────────┐               │
│                                    │  LLM    │               │
│                                    │ Generate│               │
│                                    └────┬────┘               │
│                                         ↓                    │
│                                    ┌─────────┐               │
│                                    │ Answer  │               │
│                                    │ + Cited │               │
│                                    │ Sources │               │
│                                    └─────────┘               │
└──────────────────────────────────────────────────────────────┘

Pipeline Components Deep Dive¶

1. Document Loading & Processing¶

# Common formats: PDF, DOCX, HTML, Markdown, CSV, code files
# Key challenge: Preserving structure (tables, headers, lists)

# Tools: LangChain loaders, Unstructured.io, LlamaIndex readers

2. Chunking (CRITICAL — where most pipelines fail)¶

Strategy           | Chunk Size | Overlap | When to Use
────────────────────────────────────────────────────────
Fixed size          | 500-1000   | 100-200 | Quick start, general docs
Recursive splitting | 500-1000   | 100-200 | Text with natural hierarchy
Semantic            | Variable   | N/A     | When meaning boundaries matter
Document-based      | Full doc   | N/A     | Short docs (emails, tickets)
Sentence-level      | 1-3 sents  | N/A     | Q&A, precise retrieval

⚠️ GOTCHA: Bad chunking = bad retrieval = bad answers.
   If your RAG sucks, fix chunking FIRST.

3. Embedding Models¶

Convert text chunks and queries into high-dimensional vectors for similarity search.

Model	Dimensions	Strengths
OpenAI `text-embedding-3-large`	3072	Best quality, API, Matryoshka dims
Gemini text-embedding-004	Flexible	Multimodal! (text+image+video+audio)
Cohere `embed-v4`	1024	Best multilingual (100+ languages)
Voyage AI `voyage-3-large`	—	Best for code & technical docs
`bge-m3` (BAAI)	1024	Best open-source, hybrid retrieval
`nomic-embed-v2`	768	Best for local/edge deployment

4. Vector Database¶

Database	Type	Key Feature
Pinecone	Managed	Serverless, easiest to start
Weaviate	Self-host/Managed	Hybrid search (vector + keyword)
Qdrant	Self-host/Managed	Best Rust performance
Chroma	Embedded	Simplest for prototyping
pgvector	Postgres extension	Use existing Postgres
FAISS	Library (Meta)	Fast local search, no server

5. Retrieval Strategies¶

Strategy	How	When
Semantic search	Cosine similarity on embeddings	Default
Keyword (BM25)	Traditional text matching	Technical terms, names
Hybrid	Combine semantic + keyword	Best overall performance
Re-ranking	Retrieve broadly, then re-rank with cross-encoder	Quality-critical apps
HyDE	Generate hypothetical answer, search with that	Vague queries
Multi-query	Generate multiple query variants, merge results	Complex questions

Advanced RAG Patterns (2025-2026)¶

Basic RAG
  └→ Advanced RAG
       ├── Hybrid Search (semantic + BM25)
       ├── Re-ranking (Cohere Rerank, cross-encoders)
       ├── Query Transformation (HyDE, multi-query, step-back)
       ├── Self-RAG (model decides when to retrieve)
       ├── Late Chunking (embed full doc, pool after)           ← 2025-2026 standard
       ├── Contextual Retrieval (LLM-enrich each chunk)        ← Anthropic 2024→standard
       ├── Corrective RAG / CRAG (grade + re-retrieve/search)  ← 2026 agentic pattern
       └→ Agentic RAG
            ├── Tool-calling RAG (agent decides what to search)
            ├── Multi-source RAG (different DBs, APIs, web)
            └── Multi-step RAG (iterative retrieval-reasoning loops)

Late Chunking (2025-2026 Standard for Long-Doc Retrieval)¶

Problem with traditional chunking: Splitting first loses cross-chunk context. A chunk about "the CEO's decision" has no embedding signal about which company or which decision if those appear in earlier paragraphs.

Late Chunking reverses the order:

Traditional:  Document → Chunk → Embed each chunk → Store
Late:         Document → Embed FULL document (token-level) → Pool tokens per logical chunk → Store

Result: Each chunk's embedding retains context from the surrounding document.

# pip install transformers>=4.45 torch>=2.3
# ⚠️ Last tested: 2026-04 | Requires: jina-embeddings-v3 or nomic-embed-text-v2
# PSEUDOCODE — illustrative of the late chunking concept

from transformers import AutoTokenizer, AutoModel
import torch

def late_chunking_embed(document: str, chunk_boundaries: list[tuple[int,int]], model_name: str = "jinaai/jina-embeddings-v3"):
    """Embed at token level, then pool into chunk-level embeddings."""
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

    # 1. Tokenize the FULL document (not individual chunks)
    inputs = tokenizer(document, return_tensors="pt", truncation=True, max_length=8192)

    with torch.no_grad():
        # 2. Get token-level embeddings (contextually aware of full document)
        outputs = model(**inputs)
        token_embeddings = outputs.last_hidden_state[0]  # shape: [n_tokens, hidden_dim]

    # 3. Pool token embeddings per logical chunk boundary
    chunk_embeddings = []
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    for char_start, char_end in chunk_boundaries:
        # Find token indices corresponding to this char range
        token_start, token_end = char_to_token_range(inputs, char_start, char_end)
        chunk_vec = token_embeddings[token_start:token_end].mean(dim=0)  # mean pool
        chunk_embeddings.append(chunk_vec)
    return chunk_embeddings

# Supported models: jinaai/jina-embeddings-v3, nomic-ai/nomic-embed-text-v2-moe

When to use: Long documents (reports, legal docs, books) where a chunk's meaning depends on earlier context. Outperforms standard chunking by 10-20% on multi-hop retrieval benchmarks.

Contextual Retrieval (Anthropic, 2024 → 2026 Production Standard)¶

Problem: BM25 and semantic search fail when chunks use pronouns or references ("the approach," "this method") without repeating the noun.

Solution: Use an LLM to prepend a context string to each chunk before embedding.

# pip install anthropic>=0.34
# ⚠️ Last tested: 2026-04 | Requires: anthropic>=0.34, ANTHROPIC_API_KEY
# Cost note: use prompt caching — cache the full document, vary only per chunk

import anthropic

client = anthropic.Anthropic()

CONTEXT_PROMPT = """<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_content}
</chunk>
Please give a short succinct context (1-2 sentences) to situate this chunk within the overall document
for the purpose of improving search retrieval. Answer only with the succinct context."""

def add_chunk_context(full_document: str, chunk: str) -> str:
    """Prepend LLM-generated context to a chunk before embedding."""
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",  # cheapest Haiku for speed
        max_tokens=150,
        messages=[{"role": "user", "content": CONTEXT_PROMPT.format(
            full_document=full_document,
            chunk_content=chunk
        )}]
    )
    context = response.content[0].text.strip()
    return f"{context}\n\n{chunk}"  # prepend context, embed the combined string

# Cost optimization: use prompt caching on the document prefix
# Anthropic claims 49% retrieval improvement + 67% with hybrid (BM25 + semantic)
enriched_chunk = add_chunk_context(full_document="...", chunk="This approach...")

Corrective RAG / CRAG — Grade → Decide → Act¶

When retrieved chunks are irrelevant, CRAG falls back to web search or query reformulation:

Query
  ↓
[Retrieve] → chunks
  ↓
[Grade each chunk: RELEVANT / IRRELEVANT / AMBIGUOUS]
  ↓
If ALL irrelevant → [Web Search] or [Reformulate Query] → [Re-retrieve]
If SOME relevant → [Filter to relevant chunks only]
  ↓
[Generate answer from filtered/searched context]

# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60
# PSEUDOCODE — the grading + fallback pattern

from openai import OpenAI
import json

client = OpenAI()

def grade_chunk_relevance(question: str, chunk: str) -> str:
    """Grade retrieved chunk as relevant or not. Returns 'yes' or 'no'."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                f"Is the following document relevant to answering the question?\n\n"
                f"Question: {question}\n\nDocument: {chunk[:500]}\n\n"
                f"Answer with JSON: {{\"relevant\": true/false}}"
            )
        }],
        temperature=0,
        response_format={"type": "json_object"},
    )
    result = json.loads(resp.choices[0].message.content)
    return "yes" if result.get("relevant") else "no"

def corrective_rag(question: str, retrieved_chunks: list[str], web_search_fn=None) -> str:
    """CRAG: filter irrelevant chunks, fall back to web search if needed."""
    relevant = [c for c in retrieved_chunks if grade_chunk_relevance(question, c) == "yes"]

    if not relevant:
        # Fall back to web search (tavily, serper, brave search APIs)
        if web_search_fn:
            relevant = web_search_fn(question)
        else:
            return "I don't have reliable information to answer this."

    context = "\n\n".join(relevant)
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer based on context only. Be concise."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    )
    return resp.choices[0].message.content

Reciprocal Rank Fusion (RRF) — Combining Multiple Retrieval Lists¶

RRF merges ranked lists from multiple retrieval methods (semantic + BM25 + metadata) without needing score normalization:

RRF(document d) = Σ_i  1 / (k + rank_i(d))
  where: k = 60 (constant, empirically validated), rank_i = rank in list i

# ⚠️ Last tested: 2026-04 | Requires: Python 3.10+ (stdlib only)
from collections import defaultdict

def reciprocal_rank_fusion(ranked_lists: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    """
    Combine multiple ranked retrieval lists using RRF.
    Input: list of lists, each sorted by relevance (best first).
    Output: combined list sorted by fused score (best first).
    """
    scores: dict[str, float] = defaultdict(float)
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list, start=1):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Example: combine semantic search + BM25 results
semantic_results = ["doc_3", "doc_1", "doc_7", "doc_2", "doc_5"]  # ordered by cosine sim
bm25_results     = ["doc_1", "doc_3", "doc_8", "doc_5", "doc_4"]  # ordered by BM25 score

fused = reciprocal_rank_fusion([semantic_results, bm25_results])
top_5 = [doc_id for doc_id, _ in fused[:5]]
print(f"Fused top-5: {top_5}")
# → doc_1 and doc_3 both appear in both lists at high ranks → RRF promotes them

RAG vs Fine-tuning vs Long Context¶

Aspect	RAG	Fine-tuning	Long Context
When	Need up-to-date/private data	Need changed behavior/style	All info fits in context
Cost	Low (retrieval infra)	Medium (training compute)	High (per-token cost)
Latency	+retrieval time	Same as base model	Increases with context
Knowledge	Dynamic, updatable	Static (baked in)	Dynamic (in prompt)
Best for	Enterprise docs, knowledge bases	Domain-specific models	Small document sets

2025-2026 consensus: Hybrid RAG + LoRA fine-tuning is the gold standard. RAG for facts, fine-tuning for behavior.

◆ Code & Implementation¶

Minimal RAG with LangChain (Python)¶

# pip install langchain>=0.3 langchain-openai>=0.3 langchain-community>=0.3 chromadb>=0.5
# ⚠️ Last tested: 2026-04 | Requires: langchain>=0.3

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# 1. Load & Chunk
loader = PyPDFLoader("your_document.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(docs)

# 2. Embed & Store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Build Retrieval Chain (modern LangChain pattern)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based only on the provided context. If unsure, say so.\n\nContext:\n{context}"),
    ("human", "{input}"),
])

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

result = rag_chain.invoke({"input": "What are the key findings?"})
print(result["answer"])
# result["context"] contains the retrieved source documents

◆ Strengths vs Limitations¶

✅ Strengths	❌ Limitations
No model retraining needed	Retrieval quality bottleneck
Always up-to-date (update docs, not model)	Chunking is hard to get right
Cites sources (traceable)	Adds latency (retrieval step)
Works with private/proprietary data	Context window limits how much retrieved context fits
Cheaper than fine-tuning	Can't change model behavior, only provide info

◆ Quick Reference¶

RAG Pipeline:
  Documents → Chunk → Embed → Store in Vector DB
  Query → Embed → Search → Top-K Chunks → LLM → Answer

Key Params to Tune:
  - Chunk size: 500-1000 chars (start here)
  - Chunk overlap: ~20% of chunk size
  - Top-K: 3-10 chunks (more = more context, more noise)
  - Embedding model: text-embedding-3-small (budget) or large (quality)
  - Search type: Hybrid (semantic + BM25) when possible

Quick Debug:
  Bad answers? → Check retrieved chunks first
  Irrelevant chunks? → Fix chunking or embedding model
  Good chunks, bad answer? → Fix prompt or try better LLM

RAG Evaluation Metrics (RAG Triad)¶

THE RAG TRIAD — 3 metrics that cover everything:

  1. CONTEXT RELEVANCE (retrieval quality)
     "Are the retrieved chunks actually relevant to the question?"
     Metric: What % of retrieved context is useful
     Low score → Fix: chunking strategy, embedding model, or retrieval method

  2. FAITHFULNESS / GROUNDEDNESS (hallucination check)
     "Is the answer supported by the retrieved context?"
     Metric: What % of claims in the answer can be traced to context
     Low score → Fix: LLM prompt (cite sources), temperature, or model choice

  3. ANSWER RELEVANCE (response quality)
     "Does the answer actually address the question?"
     Metric: How well does the answer match the user's intent
     Low score → Fix: prompt template, LLM model, or retrieval strategy

  ┌─────────────┐      ┌────────────┐      ┌─────────────┐
  │   Question  │──1──▶│  Context   │──2──▶│   Answer    │
  │             │      │ (retrieved)│      │ (generated) │
  └──────┬──────┘      └────────────┘      └──────┬──────┘
         │                                        │
         └──────────────3──────────────────────────┘

  1 = Context Relevance   2 = Faithfulness   3 = Answer Relevance

EVALUATION TOOLS:
  RAGAS          — most popular, uses LLM-as-judge, supports all 3 metrics
  DeepEval       — unit-test style, CI/CD friendly
  TruLens        — real-time monitoring, tracing
  LangSmith      — LangChain's eval + tracing platform
  Arize Phoenix  — open-source LLM observability

○ Gotchas & Common Mistakes¶

⚠️ "Garbage in, garbage out": If your chunking splits a table across chunks, the answer will be wrong. Always inspect chunks.
⚠️ Embedding model mismatch: embedding model for indexing MUST match the one used for querying
⚠️ Over-retrieving: More chunks ≠ better. Too many chunks adds noise and can confuse the LLM.
⚠️ Ignoring hybrid search: Pure semantic search misses exact keywords, names, IDs. Always consider BM25 + semantic.
⚠️ Not evaluating: Most teams deploy RAG without measuring retrieval quality. Use RAGAS, DeepEval, or at minimum test manually.

○ Interview Angles¶

Q: How would you improve a RAG pipeline that's giving wrong answers?
A: Debug in order: (1) Check if correct chunks are retrieved (retrieval eval), (2) If not, fix chunking strategy or embedding model, (3) If chunks are good but answer is wrong, fix the prompt or use a better LLM. Also consider adding re-ranking.
Q: When would you choose RAG over fine-tuning?
A: RAG when: need up-to-date info, knowledge changes frequently, need source attribution. Fine-tuning when: need different output style/format, domain-specific reasoning, or model behavior changes. Best: combine both (Hybrid RAG).
Q: Explain the difference between semantic and keyword search in RAG.
A: Semantic (vector) search finds conceptually similar content even with different words ("car" matches "automobile"). Keyword (BM25) search finds exact term matches. Hybrid combines both — best overall because semantic misses exact terms and BM25 misses synonyms.

★ Connections¶

Relationship	Topics
Builds on	Llms Overview, Embeddings, Vector Search
Leads to	Ai Agents (Agentic RAG), Vector Databases
Compare with	Fine Tuning (changes model), Long-context (no retrieval), Knowledge graphs
Cross-domain	Information retrieval (IR), Search engines

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Context poisoning	LLM generates confidently wrong answers with citations	Irrelevant or contradictory chunks retrieved	Reranking layer (cross-encoder), relevance score thresholds
Stale embeddings	Correct docs exist but aren't retrieved	New docs added without re-embedding, index not refreshed	Incremental indexing pipeline, TTL on embeddings
Chunk boundary loss	Answers miss key information that spans two chunks	Important context split across chunk boundaries	Overlapping chunks, parent-document retrieval, Late Chunking
Retrieval drift	Quality degrades over weeks without code changes	User query distribution shifts away from test queries	Continuous retrieval eval (MRR/nDCG), query log monitoring
Context window overflow	Token limit errors or truncated context	Too many chunks retrieved, no length management	Dynamic k selection, token-budget-aware retrieval
Chunk context loss	Answers miss cross-chunk references ("this approach")	Chunks embedded without surrounding document context	Contextual Retrieval (LLM-enrich) or Late Chunking
Retrieval grade blindness	CRAG/Adaptive RAG degrades without grader feedback	No mechanism to detect and correct poor retrieval	Add relevance grader node; fall back to web search on low scores

◆ Hands-On Exercises¶

Exercise 1: Build and Break a RAG Pipeline¶

Goal: Build a minimal RAG pipeline, then systematically break it with adversarial queries Time: 45 minutes Steps: 1. Load a 10-page PDF with PyPDFLoader 2. Chunk with RecursiveCharacterTextSplitter (1000 chars, 200 overlap) 3. Embed with text-embedding-3-small, store in Chroma 4. Query with 5 normal questions — log retrieval scores 5. Query with 5 adversarial queries (ambiguous, multi-hop, out-of-scope) — document failures Expected Output: Table comparing retrieval precision for normal vs adversarial queries

Exercise 2: Add a Reranking Layer¶

Goal: Add a cross-encoder reranker and measure retrieval quality improvement Time: 30 minutes Steps: 1. Take the pipeline from Exercise 1 2. Add sentence-transformers cross-encoder reranking on top-20 results 3. Re-run the same 10 queries 4. Compare MRR@5 before and after reranking Expected Output: MRR improvement of 15-30% on adversarial queries

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Lewis et al. "Retrieval-Augmented Generation" (2020)	The original RAG paper
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 3-4	Best practical treatment of RAG architecture and evaluation
🎓 Course	deeplearning.ai — "Building and Evaluating RAG"	Hands-on RAG implementation course
🔧 Hands-on	LlamaIndex RAG Tutorial	Production RAG framework with excellent documentation

★ Sources¶

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)
LangChain documentation — https://docs.langchain.com
LlamaIndex documentation — https://docs.llamaindex.ai
RAGAS evaluation framework — https://docs.ragas.io