Skip to content

Retrieval Evaluation

Bit: If your RAG system retrieves the wrong documents, no amount of prompt engineering will fix the answer. Retrieval evaluation measures whether the right information reaches the model — the most neglected and most impactful part of RAG quality.


★ TL;DR

  • What: Metrics and methods for measuring retrieval quality in RAG systems — precision, recall, MRR, nDCG, and end-to-end RAG evaluation
  • Why: Most RAG failures are retrieval failures. If irrelevant chunks reach the model, it hallucinates or refuses. Measuring retrieval quality separately from generation quality is essential.
  • Key point: Evaluate retrieval independently from generation. A perfect LLM can't help if retrieval gives it the wrong documents.

★ Overview

Definition

Retrieval evaluation measures how effectively a retrieval system finds and ranks relevant documents for a given query. In RAG, this means measuring whether the chunks that reach the LLM actually contain the information needed to answer the question.

Scope

Covers: Core retrieval metrics (precision@k, recall@k, MRR, nDCG), RAG-specific evaluation (context relevance, groundedness, faithfulness), evaluation frameworks (RAGAS, DeepEval), and practical implementation. For LLM generation evaluation, see LLM Evaluation.

Prerequisites


★ Deep Dive

The Two-Stage RAG Evaluation Framework

RAG EVALUATION = RETRIEVAL QUALITY + GENERATION QUALITY

  RETRIEVAL METRICS           GENERATION METRICS
  (independent of LLM)        (depends on retrieval)
  ─────────────────           ──────────────────
  Precision@K                 Answer correctness
  Recall@K                    Faithfulness (grounded in context)
  MRR                         Answer relevance
  nDCG                        Completeness
  Hit Rate                    Hallucination rate

  KEY INSIGHT: Always evaluate retrieval FIRST.
  If retrieval is bad, improving the LLM prompt won't help.

Core Retrieval Metrics

Metric What It Measures Formula Range
Precision@K % of retrieved docs that are relevant relevant_in_top_k / k 0-1
Recall@K % of all relevant docs that were retrieved relevant_in_top_k / total_relevant 0-1
Hit Rate Did at least one relevant doc appear in top K? 1 if any relevant in top_k, else 0 0 or 1
MRR How high is the first relevant result? 1 / rank_of_first_relevant 0-1
nDCG@K Quality of ranking (higher rank = more weight) See formula below 0-1

nDCG: The Gold Standard Ranking Metric

nDCG@K = DCG@K / IDCG@K

  DCG@K  = Σ (relevance_i / log2(rank_i + 1))  for i = 1 to K
  IDCG@K = DCG@K with perfect ranking (most relevant first)

  EXAMPLE:
  Query: "What is attention in transformers?"
  Retrieved docs (relevance: 0=irrelevant, 1=somewhat, 2=highly):

  Rank 1: "Attention mechanism explained" → relevance 2
  Rank 2: "CNN architectures"             → relevance 0
  Rank 3: "Self-attention in BERT"        → relevance 2
  Rank 4: "Transformer training tips"     → relevance 1

  DCG@4  = 2/log2(2) + 0/log2(3) + 2/log2(4) + 1/log2(5)
         = 2.0 + 0.0 + 1.0 + 0.43 = 3.43

  Perfect ranking: [2, 2, 1, 0]
  IDCG@4 = 2/log2(2) + 2/log2(3) + 1/log2(4) + 0/log2(5)
         = 2.0 + 1.26 + 0.5 + 0.0 = 3.76

  nDCG@4 = 3.43 / 3.76 = 0.91  (good ranking!)

RAG-Specific Metrics (RAGAS Framework)

Metric What It Measures How
Context Relevance Are retrieved chunks relevant to the query? LLM judges if each chunk is relevant
Context Recall Do retrieved chunks cover all needed info? Compare retrieved info vs ground truth answer
Faithfulness Is the answer grounded in retrieved context? Check each claim in answer against context
Answer Relevance Does the answer address the question? LLM judges answer-question alignment

★ Code & Implementation

Retrieval Evaluation Suite

# pip install numpy>=1.24
# ⚠️ Last tested: 2026-04 | No external API needed

import numpy as np
from typing import Optional

def precision_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    """Precision@K: fraction of top-K results that are relevant."""
    top_k = retrieved[:k]
    return len(set(top_k) & relevant) / k

def recall_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    """Recall@K: fraction of all relevant docs found in top-K."""
    if not relevant:
        return 0.0
    top_k = retrieved[:k]
    return len(set(top_k) & relevant) / len(relevant)

def hit_rate(retrieved: list[str], relevant: set[str], k: int) -> float:
    """Hit Rate: 1 if any relevant doc in top-K, else 0."""
    return 1.0 if set(retrieved[:k]) & relevant else 0.0

def mrr(retrieved: list[str], relevant: set[str]) -> float:
    """Mean Reciprocal Rank: 1/rank of first relevant result."""
    for i, doc in enumerate(retrieved):
        if doc in relevant:
            return 1.0 / (i + 1)
    return 0.0

def ndcg_at_k(retrieved: list[str], relevance_scores: dict[str, int], k: int) -> float:
    """nDCG@K: normalized discounted cumulative gain."""
    # DCG
    dcg = 0.0
    for i, doc in enumerate(retrieved[:k]):
        rel = relevance_scores.get(doc, 0)
        dcg += rel / np.log2(i + 2)  # +2 because rank is 1-indexed

    # IDCG (ideal ranking)
    ideal_rels = sorted(relevance_scores.values(), reverse=True)[:k]
    idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(ideal_rels))

    return dcg / idcg if idcg > 0 else 0.0

# --- Evaluate a RAG retrieval system ---
def evaluate_retrieval(
    queries: list[str],
    retrieved_per_query: list[list[str]],
    relevant_per_query: list[set[str]],
    k: int = 5,
) -> dict:
    """Evaluate retrieval quality across multiple queries."""
    metrics = {"precision": [], "recall": [], "hit_rate": [], "mrr": []}

    for retrieved, relevant in zip(retrieved_per_query, relevant_per_query):
        metrics["precision"].append(precision_at_k(retrieved, relevant, k))
        metrics["recall"].append(recall_at_k(retrieved, relevant, k))
        metrics["hit_rate"].append(hit_rate(retrieved, relevant, k))
        metrics["mrr"].append(mrr(retrieved, relevant))

    return {name: f"{np.mean(values):.3f}" for name, values in metrics.items()}

# Example usage
queries = ["What is attention?", "How does RLHF work?"]
retrieved = [
    ["doc_attention", "doc_cnn", "doc_bert", "doc_rnn", "doc_transformers"],
    ["doc_rlhf", "doc_ppo", "doc_dpo", "doc_sft", "doc_reward"],
]
relevant = [
    {"doc_attention", "doc_bert", "doc_transformers"},
    {"doc_rlhf", "doc_ppo", "doc_dpo"},
]

results = evaluate_retrieval(queries, retrieved, relevant, k=5)
print("Retrieval Quality:")
for metric, value in results.items():
    print(f"  {metric:>12}: {value}")

# Expected output:
#   precision: 0.500
#      recall: 0.833
#    hit_rate: 1.000
#         mrr: 1.000

◆ Quick Reference

RETRIEVAL QUALITY BENCHMARKS:

  Metric        | Poor    | Acceptable | Good    | Excellent
  ──────────────|─────────|───────────|─────────|──────────
  Precision@5   | < 0.30  | 0.30-0.50 | 0.50-0.70 | > 0.70
  Recall@5      | < 0.40  | 0.40-0.60 | 0.60-0.80 | > 0.80
  Hit Rate@5    | < 0.60  | 0.60-0.80 | 0.80-0.90 | > 0.90
  MRR           | < 0.40  | 0.40-0.60 | 0.60-0.80 | > 0.80
  nDCG@5        | < 0.40  | 0.40-0.60 | 0.60-0.80 | > 0.80

IF RETRIEVAL IS BAD, FIX THIS FIRST:
  1. Check embedding model quality
  2. Check chunking strategy (too big / too small?)
  3. Add metadata filtering
  4. Try hybrid search (semantic + keyword)
  5. Consider reranking (cross-encoder)

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Low recall Correct answer exists in corpus but model says "I don't know" Relevant chunks not retrieved (embedding mismatch, wrong chunk size) Improve embedding model, tune chunk size, add keyword search
Low precision Model hallucinates, using irrelevant retrieved chunks Too many irrelevant chunks retrieved Add reranking step, reduce k, tighten similarity threshold
Evaluation set drift Metrics look good but users complain Eval set doesn't represent real user queries Regularly update eval set from production query logs

○ Interview Angles

  • Q: How would you evaluate a RAG system?
  • A: I'd evaluate in two stages. Stage 1: Retrieval quality — I'd create a labeled dataset of 100+ queries with known relevant documents, then measure Precision@5, Recall@5, MRR, and nDCG. If retrieval metrics are poor (< 0.5 precision), I'd fix retrieval before touching the LLM. Stage 2: End-to-end — using RAGAS metrics (context relevance, faithfulness, answer correctness) to evaluate the full pipeline. I'd run this on every prompt/model change as a regression check. For production, I'd add online metrics: user feedback (thumbs up/down), citation click-through rate, and "I don't know" rate.

◆ Hands-On Exercises

Exercise 1: Build a Retrieval Eval Suite

Goal: Evaluate your RAG retrieval pipeline with standard metrics Time: 45 minutes Steps: 1. Create 20 test queries with labeled relevant documents 2. Run your retrieval system on all 20 queries 3. Calculate Precision@5, Recall@5, MRR, and Hit Rate using the code above 4. Identify the 5 worst queries — what went wrong with retrieval? 5. Try one improvement (better chunking, reranking, hybrid search) and re-measure Expected Output: Before/after metrics table, analysis of failure patterns


★ Connections

Relationship Topics
Builds on RAG, Vector Databases, Embeddings
Leads to RAG optimization, search quality improvement, production monitoring
Compare with LLM Evaluation (generation vs retrieval evaluation)
Cross-domain Information retrieval, search engineering, library science

Type Resource Why
🔧 Hands-on RAGAS Documentation Production RAG evaluation framework
📄 Paper Barnett et al. "Seven Failure Points of RAG" (2024) Systematic analysis of where RAG systems fail
📘 Book "AI Engineering" by Chip Huyen (2025), Ch 4 Retrieval evaluation as part of the RAG quality stack
🔧 Hands-on DeepEval Open-source LLM + retrieval evaluation framework

★ Sources

  • Barnett et al. "Seven Failure Points When Engineering a RAG System" (2024)
  • RAGAS Documentation — https://docs.ragas.io/
  • Manning et al. "Introduction to Information Retrieval" (2008), Ch 8
  • RAG