Retrieval-Augmented Generation (RAG)¶
✨ Bit: RAG is like giving the LLM an open-book exam instead of asking it to recall everything from memory. Turns out, even AI does better with notes.
★ TL;DR¶
- What: A pattern that retrieves relevant external documents and feeds them to an LLM as context before generation
- Why: Fixes hallucination, enables up-to-date answers, works with private data — WITHOUT retraining the model
- Key point: The dominant technique for enterprise GenAI. If you're building GenAI products, you ARE building RAG pipelines.
★ Overview¶
Definition¶
RAG (Retrieval-Augmented Generation) is an architecture that combines information retrieval with text generation. Instead of relying solely on the LLM's parametric knowledge (what it memorized during training), RAG retrieves relevant documents from an external knowledge base at query time and includes them in the prompt.
Scope¶
This document covers RAG architecture, pipeline components, and advanced patterns. For embedding models and vector databases, see Vector Databases. For combining RAG with fine-tuning, see Fine Tuning.
Significance¶
- Most deployed GenAI pattern in production (2024-2026)
- Solves: Hallucination, knowledge cutoff, private data access
- Doesn't require: Retraining or fine-tuning the LLM
- Industry standard for: Enterprise Q&A, customer support, document intelligence
Prerequisites¶
- Llms Overview — what LLMs are and how they work
- Basic understanding of embeddings (vector representations of text)
★ Deep Dive¶
The Basic RAG Pipeline¶
┌──────────────────────────────────────────────────────────────┐
│ RAG PIPELINE │
│ │
│ INDEXING (one-time / periodic) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │
│ │ Documents│ → │ Chunking │ → │ Embedding│ → │ Vector │ │
│ │ (raw) │ │ (split) │ │ (encode) │ │ DB │ │
│ └──────────┘ └──────────┘ └──────────┘ └─────────┘ │
│ │
│ RETRIEVAL + GENERATION (per query) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │
│ │ User │ → │ Embed │ → │ Search │ → │ Top-K │ │
│ │ Query │ │ Query │ │ Vector DB│ │ Chunks │ │
│ └──────────┘ └──────────┘ └──────────┘ └────┬────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐│ │
│ │ PROMPT = System Instructions + Retrieved Chunks + Query ││
│ └──────────────────────────────────────┬───────────┘│ │
│ ↓ │ │
│ ┌─────────┐ │
│ │ LLM │ │
│ │ Generate│ │
│ └────┬────┘ │
│ ↓ │
│ ┌─────────┐ │
│ │ Answer │ │
│ │ + Cited │ │
│ │ Sources │ │
│ └─────────┘ │
└──────────────────────────────────────────────────────────────┘
Pipeline Components Deep Dive¶
1. Document Loading & Processing¶
# Common formats: PDF, DOCX, HTML, Markdown, CSV, code files
# Key challenge: Preserving structure (tables, headers, lists)
# Tools: LangChain loaders, Unstructured.io, LlamaIndex readers
2. Chunking (CRITICAL — where most pipelines fail)¶
Strategy | Chunk Size | Overlap | When to Use
────────────────────────────────────────────────────────
Fixed size | 500-1000 | 100-200 | Quick start, general docs
Recursive splitting | 500-1000 | 100-200 | Text with natural hierarchy
Semantic | Variable | N/A | When meaning boundaries matter
Document-based | Full doc | N/A | Short docs (emails, tickets)
Sentence-level | 1-3 sents | N/A | Q&A, precise retrieval
⚠️ GOTCHA: Bad chunking = bad retrieval = bad answers.
If your RAG sucks, fix chunking FIRST.
3. Embedding Models¶
Convert text chunks and queries into high-dimensional vectors for similarity search.
| Model | Dimensions | Strengths |
|---|---|---|
OpenAI text-embedding-3-large |
3072 | Best quality, API, Matryoshka dims |
| Gemini text-embedding-004 | Flexible | Multimodal! (text+image+video+audio) |
Cohere embed-v4 |
1024 | Best multilingual (100+ languages) |
Voyage AI voyage-3-large |
— | Best for code & technical docs |
bge-m3 (BAAI) |
1024 | Best open-source, hybrid retrieval |
nomic-embed-v2 |
768 | Best for local/edge deployment |
4. Vector Database¶
| Database | Type | Key Feature |
|---|---|---|
| Pinecone | Managed | Serverless, easiest to start |
| Weaviate | Self-host/Managed | Hybrid search (vector + keyword) |
| Qdrant | Self-host/Managed | Best Rust performance |
| Chroma | Embedded | Simplest for prototyping |
| pgvector | Postgres extension | Use existing Postgres |
| FAISS | Library (Meta) | Fast local search, no server |
5. Retrieval Strategies¶
| Strategy | How | When |
|---|---|---|
| Semantic search | Cosine similarity on embeddings | Default |
| Keyword (BM25) | Traditional text matching | Technical terms, names |
| Hybrid | Combine semantic + keyword | Best overall performance |
| Re-ranking | Retrieve broadly, then re-rank with cross-encoder | Quality-critical apps |
| HyDE | Generate hypothetical answer, search with that | Vague queries |
| Multi-query | Generate multiple query variants, merge results | Complex questions |
Advanced RAG Patterns (2025-2026)¶
Basic RAG
└→ Advanced RAG
├── Hybrid Search (semantic + BM25)
├── Re-ranking (Cohere Rerank, cross-encoders)
├── Query Transformation (HyDE, multi-query, step-back)
├── Self-RAG (model decides when to retrieve)
├── Late Chunking (embed full doc, pool after) ← 2025-2026 standard
├── Contextual Retrieval (LLM-enrich each chunk) ← Anthropic 2024→standard
├── Corrective RAG / CRAG (grade + re-retrieve/search) ← 2026 agentic pattern
└→ Agentic RAG
├── Tool-calling RAG (agent decides what to search)
├── Multi-source RAG (different DBs, APIs, web)
└── Multi-step RAG (iterative retrieval-reasoning loops)
Late Chunking (2025-2026 Standard for Long-Doc Retrieval)¶
Problem with traditional chunking: Splitting first loses cross-chunk context. A chunk about "the CEO's decision" has no embedding signal about which company or which decision if those appear in earlier paragraphs.
Late Chunking reverses the order:
Traditional: Document → Chunk → Embed each chunk → Store
Late: Document → Embed FULL document (token-level) → Pool tokens per logical chunk → Store
Result: Each chunk's embedding retains context from the surrounding document.
# pip install transformers>=4.45 torch>=2.3
# ⚠️ Last tested: 2026-04 | Requires: jina-embeddings-v3 or nomic-embed-text-v2
# PSEUDOCODE — illustrative of the late chunking concept
from transformers import AutoTokenizer, AutoModel
import torch
def late_chunking_embed(document: str, chunk_boundaries: list[tuple[int,int]], model_name: str = "jinaai/jina-embeddings-v3"):
"""Embed at token level, then pool into chunk-level embeddings."""
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# 1. Tokenize the FULL document (not individual chunks)
inputs = tokenizer(document, return_tensors="pt", truncation=True, max_length=8192)
with torch.no_grad():
# 2. Get token-level embeddings (contextually aware of full document)
outputs = model(**inputs)
token_embeddings = outputs.last_hidden_state[0] # shape: [n_tokens, hidden_dim]
# 3. Pool token embeddings per logical chunk boundary
chunk_embeddings = []
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for char_start, char_end in chunk_boundaries:
# Find token indices corresponding to this char range
token_start, token_end = char_to_token_range(inputs, char_start, char_end)
chunk_vec = token_embeddings[token_start:token_end].mean(dim=0) # mean pool
chunk_embeddings.append(chunk_vec)
return chunk_embeddings
# Supported models: jinaai/jina-embeddings-v3, nomic-ai/nomic-embed-text-v2-moe
When to use: Long documents (reports, legal docs, books) where a chunk's meaning depends on earlier context. Outperforms standard chunking by 10-20% on multi-hop retrieval benchmarks.
Contextual Retrieval (Anthropic, 2024 → 2026 Production Standard)¶
Problem: BM25 and semantic search fail when chunks use pronouns or references ("the approach," "this method") without repeating the noun.
Solution: Use an LLM to prepend a context string to each chunk before embedding.
# pip install anthropic>=0.34
# ⚠️ Last tested: 2026-04 | Requires: anthropic>=0.34, ANTHROPIC_API_KEY
# Cost note: use prompt caching — cache the full document, vary only per chunk
import anthropic
client = anthropic.Anthropic()
CONTEXT_PROMPT = """<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_content}
</chunk>
Please give a short succinct context (1-2 sentences) to situate this chunk within the overall document
for the purpose of improving search retrieval. Answer only with the succinct context."""
def add_chunk_context(full_document: str, chunk: str) -> str:
"""Prepend LLM-generated context to a chunk before embedding."""
response = client.messages.create(
model="claude-3-5-haiku-20241022", # cheapest Haiku for speed
max_tokens=150,
messages=[{"role": "user", "content": CONTEXT_PROMPT.format(
full_document=full_document,
chunk_content=chunk
)}]
)
context = response.content[0].text.strip()
return f"{context}\n\n{chunk}" # prepend context, embed the combined string
# Cost optimization: use prompt caching on the document prefix
# Anthropic claims 49% retrieval improvement + 67% with hybrid (BM25 + semantic)
enriched_chunk = add_chunk_context(full_document="...", chunk="This approach...")
Corrective RAG / CRAG — Grade → Decide → Act¶
When retrieved chunks are irrelevant, CRAG falls back to web search or query reformulation:
Query
↓
[Retrieve] → chunks
↓
[Grade each chunk: RELEVANT / IRRELEVANT / AMBIGUOUS]
↓
If ALL irrelevant → [Web Search] or [Reformulate Query] → [Re-retrieve]
If SOME relevant → [Filter to relevant chunks only]
↓
[Generate answer from filtered/searched context]
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60
# PSEUDOCODE — the grading + fallback pattern
from openai import OpenAI
import json
client = OpenAI()
def grade_chunk_relevance(question: str, chunk: str) -> str:
"""Grade retrieved chunk as relevant or not. Returns 'yes' or 'no'."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": (
f"Is the following document relevant to answering the question?\n\n"
f"Question: {question}\n\nDocument: {chunk[:500]}\n\n"
f"Answer with JSON: {{\"relevant\": true/false}}"
)
}],
temperature=0,
response_format={"type": "json_object"},
)
result = json.loads(resp.choices[0].message.content)
return "yes" if result.get("relevant") else "no"
def corrective_rag(question: str, retrieved_chunks: list[str], web_search_fn=None) -> str:
"""CRAG: filter irrelevant chunks, fall back to web search if needed."""
relevant = [c for c in retrieved_chunks if grade_chunk_relevance(question, c) == "yes"]
if not relevant:
# Fall back to web search (tavily, serper, brave search APIs)
if web_search_fn:
relevant = web_search_fn(question)
else:
return "I don't have reliable information to answer this."
context = "\n\n".join(relevant)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer based on context only. Be concise."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return resp.choices[0].message.content
Reciprocal Rank Fusion (RRF) — Combining Multiple Retrieval Lists¶
RRF merges ranked lists from multiple retrieval methods (semantic + BM25 + metadata) without needing score normalization:
RRF(document d) = Σ_i 1 / (k + rank_i(d))
where: k = 60 (constant, empirically validated), rank_i = rank in list i
# ⚠️ Last tested: 2026-04 | Requires: Python 3.10+ (stdlib only)
from collections import defaultdict
def reciprocal_rank_fusion(ranked_lists: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
"""
Combine multiple ranked retrieval lists using RRF.
Input: list of lists, each sorted by relevance (best first).
Output: combined list sorted by fused score (best first).
"""
scores: dict[str, float] = defaultdict(float)
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list, start=1):
scores[doc_id] += 1.0 / (k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Example: combine semantic search + BM25 results
semantic_results = ["doc_3", "doc_1", "doc_7", "doc_2", "doc_5"] # ordered by cosine sim
bm25_results = ["doc_1", "doc_3", "doc_8", "doc_5", "doc_4"] # ordered by BM25 score
fused = reciprocal_rank_fusion([semantic_results, bm25_results])
top_5 = [doc_id for doc_id, _ in fused[:5]]
print(f"Fused top-5: {top_5}")
# → doc_1 and doc_3 both appear in both lists at high ranks → RRF promotes them
RAG vs Fine-tuning vs Long Context¶
| Aspect | RAG | Fine-tuning | Long Context |
|---|---|---|---|
| When | Need up-to-date/private data | Need changed behavior/style | All info fits in context |
| Cost | Low (retrieval infra) | Medium (training compute) | High (per-token cost) |
| Latency | +retrieval time | Same as base model | Increases with context |
| Knowledge | Dynamic, updatable | Static (baked in) | Dynamic (in prompt) |
| Best for | Enterprise docs, knowledge bases | Domain-specific models | Small document sets |
2025-2026 consensus: Hybrid RAG + LoRA fine-tuning is the gold standard. RAG for facts, fine-tuning for behavior.
◆ Code & Implementation¶
Minimal RAG with LangChain (Python)¶
# pip install langchain>=0.3 langchain-openai>=0.3 langchain-community>=0.3 chromadb>=0.5
# ⚠️ Last tested: 2026-04 | Requires: langchain>=0.3
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
# 1. Load & Chunk
loader = PyPDFLoader("your_document.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(docs)
# 2. Embed & Store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)
# 3. Build Retrieval Chain (modern LangChain pattern)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "Answer based only on the provided context. If unsure, say so.\n\nContext:\n{context}"),
("human", "{input}"),
])
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
result = rag_chain.invoke({"input": "What are the key findings?"})
print(result["answer"])
# result["context"] contains the retrieved source documents
◆ Strengths vs Limitations¶
| ✅ Strengths | ❌ Limitations |
|---|---|
| No model retraining needed | Retrieval quality bottleneck |
| Always up-to-date (update docs, not model) | Chunking is hard to get right |
| Cites sources (traceable) | Adds latency (retrieval step) |
| Works with private/proprietary data | Context window limits how much retrieved context fits |
| Cheaper than fine-tuning | Can't change model behavior, only provide info |
◆ Quick Reference¶
RAG Pipeline:
Documents → Chunk → Embed → Store in Vector DB
Query → Embed → Search → Top-K Chunks → LLM → Answer
Key Params to Tune:
- Chunk size: 500-1000 chars (start here)
- Chunk overlap: ~20% of chunk size
- Top-K: 3-10 chunks (more = more context, more noise)
- Embedding model: text-embedding-3-small (budget) or large (quality)
- Search type: Hybrid (semantic + BM25) when possible
Quick Debug:
Bad answers? → Check retrieved chunks first
Irrelevant chunks? → Fix chunking or embedding model
Good chunks, bad answer? → Fix prompt or try better LLM
RAG Evaluation Metrics (RAG Triad)¶
THE RAG TRIAD — 3 metrics that cover everything:
1. CONTEXT RELEVANCE (retrieval quality)
"Are the retrieved chunks actually relevant to the question?"
Metric: What % of retrieved context is useful
Low score → Fix: chunking strategy, embedding model, or retrieval method
2. FAITHFULNESS / GROUNDEDNESS (hallucination check)
"Is the answer supported by the retrieved context?"
Metric: What % of claims in the answer can be traced to context
Low score → Fix: LLM prompt (cite sources), temperature, or model choice
3. ANSWER RELEVANCE (response quality)
"Does the answer actually address the question?"
Metric: How well does the answer match the user's intent
Low score → Fix: prompt template, LLM model, or retrieval strategy
┌─────────────┐ ┌────────────┐ ┌─────────────┐
│ Question │──1──▶│ Context │──2──▶│ Answer │
│ │ │ (retrieved)│ │ (generated) │
└──────┬──────┘ └────────────┘ └──────┬──────┘
│ │
└──────────────3──────────────────────────┘
1 = Context Relevance 2 = Faithfulness 3 = Answer Relevance
EVALUATION TOOLS:
RAGAS — most popular, uses LLM-as-judge, supports all 3 metrics
DeepEval — unit-test style, CI/CD friendly
TruLens — real-time monitoring, tracing
LangSmith — LangChain's eval + tracing platform
Arize Phoenix — open-source LLM observability
○ Gotchas & Common Mistakes¶
- ⚠️ "Garbage in, garbage out": If your chunking splits a table across chunks, the answer will be wrong. Always inspect chunks.
- ⚠️ Embedding model mismatch: embedding model for indexing MUST match the one used for querying
- ⚠️ Over-retrieving: More chunks ≠ better. Too many chunks adds noise and can confuse the LLM.
- ⚠️ Ignoring hybrid search: Pure semantic search misses exact keywords, names, IDs. Always consider BM25 + semantic.
- ⚠️ Not evaluating: Most teams deploy RAG without measuring retrieval quality. Use RAGAS, DeepEval, or at minimum test manually.
○ Interview Angles¶
- Q: How would you improve a RAG pipeline that's giving wrong answers?
-
A: Debug in order: (1) Check if correct chunks are retrieved (retrieval eval), (2) If not, fix chunking strategy or embedding model, (3) If chunks are good but answer is wrong, fix the prompt or use a better LLM. Also consider adding re-ranking.
-
Q: When would you choose RAG over fine-tuning?
-
A: RAG when: need up-to-date info, knowledge changes frequently, need source attribution. Fine-tuning when: need different output style/format, domain-specific reasoning, or model behavior changes. Best: combine both (Hybrid RAG).
-
Q: Explain the difference between semantic and keyword search in RAG.
- A: Semantic (vector) search finds conceptually similar content even with different words ("car" matches "automobile"). Keyword (BM25) search finds exact term matches. Hybrid combines both — best overall because semantic misses exact terms and BM25 misses synonyms.
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Llms Overview, Embeddings, Vector Search |
| Leads to | Ai Agents (Agentic RAG), Vector Databases |
| Compare with | Fine Tuning (changes model), Long-context (no retrieval), Knowledge graphs |
| Cross-domain | Information retrieval (IR), Search engines |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Context poisoning | LLM generates confidently wrong answers with citations | Irrelevant or contradictory chunks retrieved | Reranking layer (cross-encoder), relevance score thresholds |
| Stale embeddings | Correct docs exist but aren't retrieved | New docs added without re-embedding, index not refreshed | Incremental indexing pipeline, TTL on embeddings |
| Chunk boundary loss | Answers miss key information that spans two chunks | Important context split across chunk boundaries | Overlapping chunks, parent-document retrieval, Late Chunking |
| Retrieval drift | Quality degrades over weeks without code changes | User query distribution shifts away from test queries | Continuous retrieval eval (MRR/nDCG), query log monitoring |
| Context window overflow | Token limit errors or truncated context | Too many chunks retrieved, no length management | Dynamic k selection, token-budget-aware retrieval |
| Chunk context loss | Answers miss cross-chunk references ("this approach") | Chunks embedded without surrounding document context | Contextual Retrieval (LLM-enrich) or Late Chunking |
| Retrieval grade blindness | CRAG/Adaptive RAG degrades without grader feedback | No mechanism to detect and correct poor retrieval | Add relevance grader node; fall back to web search on low scores |
◆ Hands-On Exercises¶
Exercise 1: Build and Break a RAG Pipeline¶
Goal: Build a minimal RAG pipeline, then systematically break it with adversarial queries Time: 45 minutes Steps: 1. Load a 10-page PDF with PyPDFLoader 2. Chunk with RecursiveCharacterTextSplitter (1000 chars, 200 overlap) 3. Embed with text-embedding-3-small, store in Chroma 4. Query with 5 normal questions — log retrieval scores 5. Query with 5 adversarial queries (ambiguous, multi-hop, out-of-scope) — document failures Expected Output: Table comparing retrieval precision for normal vs adversarial queries
Exercise 2: Add a Reranking Layer¶
Goal: Add a cross-encoder reranker and measure retrieval quality improvement Time: 30 minutes Steps: 1. Take the pipeline from Exercise 1 2. Add sentence-transformers cross-encoder reranking on top-20 results 3. Re-run the same 10 queries 4. Compare MRR@5 before and after reranking Expected Output: MRR improvement of 15-30% on adversarial queries
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Lewis et al. "Retrieval-Augmented Generation" (2020) | The original RAG paper |
| 📘 Book | "AI Engineering" by Chip Huyen (2025), Ch 3-4 | Best practical treatment of RAG architecture and evaluation |
| 🎓 Course | deeplearning.ai — "Building and Evaluating RAG" | Hands-on RAG implementation course |
| 🔧 Hands-on | LlamaIndex RAG Tutorial | Production RAG framework with excellent documentation |
★ Sources¶
- Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)
- LangChain documentation — https://docs.langchain.com
- LlamaIndex documentation — https://docs.llamaindex.ai
- RAGAS evaluation framework — https://docs.ragas.io