✨ Bit: If your RAG system retrieves the wrong documents, no amount of prompt engineering will fix the answer. Retrieval evaluation measures whether the right information reaches the model — the most neglected and most impactful part of RAG quality.
What: Metrics and methods for measuring retrieval quality in RAG systems — precision, recall, MRR, nDCG, and end-to-end RAG evaluation
Why: Most RAG failures are retrieval failures. If irrelevant chunks reach the model, it hallucinates or refuses. Measuring retrieval quality separately from generation quality is essential.
Key point: Evaluate retrieval independently from generation. A perfect LLM can't help if retrieval gives it the wrong documents.
Retrieval evaluation measures how effectively a retrieval system finds and ranks relevant documents for a given query. In RAG, this means measuring whether the chunks that reach the LLM actually contain the information needed to answer the question.
# pip install numpy>=1.24# ⚠️ Last tested: 2026-04 | No external API neededimportnumpyasnpfromtypingimportOptionaldefprecision_at_k(retrieved:list[str],relevant:set[str],k:int)->float:"""Precision@K: fraction of top-K results that are relevant."""top_k=retrieved[:k]returnlen(set(top_k)&relevant)/kdefrecall_at_k(retrieved:list[str],relevant:set[str],k:int)->float:"""Recall@K: fraction of all relevant docs found in top-K."""ifnotrelevant:return0.0top_k=retrieved[:k]returnlen(set(top_k)&relevant)/len(relevant)defhit_rate(retrieved:list[str],relevant:set[str],k:int)->float:"""Hit Rate: 1 if any relevant doc in top-K, else 0."""return1.0ifset(retrieved[:k])&relevantelse0.0defmrr(retrieved:list[str],relevant:set[str])->float:"""Mean Reciprocal Rank: 1/rank of first relevant result."""fori,docinenumerate(retrieved):ifdocinrelevant:return1.0/(i+1)return0.0defndcg_at_k(retrieved:list[str],relevance_scores:dict[str,int],k:int)->float:"""nDCG@K: normalized discounted cumulative gain."""# DCGdcg=0.0fori,docinenumerate(retrieved[:k]):rel=relevance_scores.get(doc,0)dcg+=rel/np.log2(i+2)# +2 because rank is 1-indexed# IDCG (ideal ranking)ideal_rels=sorted(relevance_scores.values(),reverse=True)[:k]idcg=sum(rel/np.log2(i+2)fori,relinenumerate(ideal_rels))returndcg/idcgifidcg>0else0.0# --- Evaluate a RAG retrieval system ---defevaluate_retrieval(queries:list[str],retrieved_per_query:list[list[str]],relevant_per_query:list[set[str]],k:int=5,)->dict:"""Evaluate retrieval quality across multiple queries."""metrics={"precision":[],"recall":[],"hit_rate":[],"mrr":[]}forretrieved,relevantinzip(retrieved_per_query,relevant_per_query):metrics["precision"].append(precision_at_k(retrieved,relevant,k))metrics["recall"].append(recall_at_k(retrieved,relevant,k))metrics["hit_rate"].append(hit_rate(retrieved,relevant,k))metrics["mrr"].append(mrr(retrieved,relevant))return{name:f"{np.mean(values):.3f}"forname,valuesinmetrics.items()}# Example usagequeries=["What is attention?","How does RLHF work?"]retrieved=[["doc_attention","doc_cnn","doc_bert","doc_rnn","doc_transformers"],["doc_rlhf","doc_ppo","doc_dpo","doc_sft","doc_reward"],]relevant=[{"doc_attention","doc_bert","doc_transformers"},{"doc_rlhf","doc_ppo","doc_dpo"},]results=evaluate_retrieval(queries,retrieved,relevant,k=5)print("Retrieval Quality:")formetric,valueinresults.items():print(f" {metric:>12}: {value}")# Expected output:# precision: 0.500# recall: 0.833# hit_rate: 1.000# mrr: 1.000
A: I'd evaluate in two stages. Stage 1: Retrieval quality — I'd create a labeled dataset of 100+ queries with known relevant documents, then measure Precision@5, Recall@5, MRR, and nDCG. If retrieval metrics are poor (< 0.5 precision), I'd fix retrieval before touching the LLM. Stage 2: End-to-end — using RAGAS metrics (context relevance, faithfulness, answer correctness) to evaluate the full pipeline. I'd run this on every prompt/model change as a regression check. For production, I'd add online metrics: user feedback (thumbs up/down), citation click-through rate, and "I don't know" rate.
Goal: Evaluate your RAG retrieval pipeline with standard metrics
Time: 45 minutes
Steps:
1. Create 20 test queries with labeled relevant documents
2. Run your retrieval system on all 20 queries
3. Calculate Precision@5, Recall@5, MRR, and Hit Rate using the code above
4. Identify the 5 worst queries — what went wrong with retrieval?
5. Try one improvement (better chunking, reranking, hybrid search) and re-measure
Expected Output: Before/after metrics table, analysis of failure patterns