✨ Bit: "You can't improve what you can't measure." In GenAI, the problem is the opposite — you CAN measure, but the benchmarks keep getting saturated. It's an arms race between models and tests.
LLM Evaluation encompasses the tools, benchmarks, metrics, and methodologies used to assess language model performance across dimensions: accuracy, reasoning, coding, safety, fairness, hallucination, and real-world utility.
Covers: Major benchmarks, RAG-specific evaluation, evaluation tools, and emerging approaches. For applied evaluation design, see LLM Evaluation Deep Dive. For interview-focused architecture framing, see System Design for AI Interviews. This is a fast-moving area - benchmarks get saturated and replaced regularly.
┌─────────────────────────────────────────────────────────┐
│ WHAT TO MEASURE │
│ │
│ 1. ACCURACY & KNOWLEDGE → Does it know things? │
│ 2. REASONING → Can it think logically? │
│ 3. CODING → Can it write code? │
│ 4. SAFETY & HARM → Is it safe to deploy? │
│ 5. FAIRNESS & BIAS → Is it equitable? │
│ 6. ROBUSTNESS → Does it handle edge cases?│
│ 7. EFFICIENCY → Is it fast & cheap enough?│
└─────────────────────────────────────────────────────────┘
The gold standard for real-world model comparison. Users chat with two anonymous models side-by-side and vote which is better. Key properties:
HOW CHATBOT ARENA WORKS:
1. User visits arena, types a prompt
2. Two ANONYMOUS models respond (user doesn't know which)
3. User votes: Model A wins / Model B wins / Tie
4. ELO rating updated (like chess rankings)
5. After enough votes, model identity revealed
WHY IT'S THE GOLD STANDARD:
✓ Contamination-resistant (prompts are user-generated, dynamic)
✓ Real-world signal (actual users, not curated test sets)
✓ Covers open-ended quality (creativity, helpfulness, nuance)
✗ Slow to converge (needs 1000s of votes for stable rankings)
✗ Only tests conversational ability (not coding, math, safety)
ELO RANKINGS (April 2026 — illustrative):
GPT-5.4: ~1350
Claude Opus 4.6: ~1345
Gemini 3.1 Pro: ~1330
GPT-5.4 mini: ~1280
LLaMA 4 Maverick: ~1260
Contamination = benchmark data leaked into training data, inflating scores.
Detection Method
How It Works
Effectiveness
Canary questions
Insert unique, never-published questions into eval set
High — if model "knows" the answer, it's contaminated
Temporal splits
Use questions created AFTER model's training cutoff
High — model can't have seen them
N-gram overlap
Check training data for exact benchmark question matches
Medium — misses paraphrases
Rephrasing attacks
Rephrase benchmark questions; check if model scores drop
High — contaminated models overfit to exact wording
Membership inference
Statistical test: does model "remember" specific examples?
Medium — requires calibration
Best practice: Always supplement static benchmarks with dynamic evaluation (LiveBench, Chatbot Arena, or your own held-out eval set refreshed monthly).
# ⚠️ Last tested: 2026-04# ═══ RAGAS: Evaluate a RAG Pipeline ═══fromragasimportevaluatefromragas.metricsimportfaithfulness,answer_relevancy,context_precisiontry:result=evaluate(dataset=your_eval_dataset,# Questions + retrieved contexts + answers + ground truthmetrics=[faithfulness,answer_relevancy,context_precision],)print(result)# {'faithfulness': 0.87, 'answer_relevancy': 0.92, 'context_precision': 0.78}exceptExceptionase:print(f"Evaluation failed: {e}")# Production: retry or fall back to subset# ═══ DEEPEVAL: Unit Tests for LLMs ═══fromdeepevalimportassert_testfromdeepeval.test_caseimportLLMTestCasefromdeepeval.metricsimportHallucinationMetrictest_case=LLMTestCase(input="What is the capital of France?",actual_output="The capital of France is Paris.",context=["Paris is the capital city of France."])metric=HallucinationMetric(threshold=0.5)assert_test(test_case,[metric])# Passes if hallucination score < 0.5
WHICH BENCHMARK FOR WHAT:
General knowledge → MMLU-Pro (MMLU is too easy now)
Coding → SWE-bench (HumanEval is too easy now)
Reasoning → GPQA-Diamond, ARC-AGI-2
Math → MATH-500, AIME
Real-world → LiveBench (dynamic, uncontaminated)
RAG quality → RAGAS metrics
Safety → Red teaming + automated harm benchmarks
MINIMUM EVAL STACK:
1. RAGAS (if building RAG)
2. A handful of golden test cases (manual)
3. LLM-as-judge for subjective quality
4. Prod monitoring (LangSmith / Phoenix)
BENCHMARK SATURATION WARNING:
If top models score > 90%, the benchmark is no longer useful
for distinguishing frontier models. Look for harder alternatives.
A: Component-level: Retrieval quality (context precision + recall) — are the right chunks found? Generation quality (faithfulness + answer relevancy) — is the answer grounded and on-topic? Use RAGAS for automated metrics, plus a golden test set of 50+ question-answer pairs with human-verified ground truth.
Q: Why are traditional benchmarks becoming less useful?
A: Saturation (top models all score >90%), contamination (benchmark data in training sets), and gap between benchmark performance and real-world utility. The field is moving to dynamic benchmarks (LiveBench), harder tests (SWE-bench, ARC-AGI-2), and domain-specific evaluation.
Goal: Create a domain-specific eval suite with automated and human metrics
Time: 45 minutes
Steps:
1. Collect 50 real user queries from your domain
2. Create gold-standard answers for each
3. Implement automated scoring (ROUGE, LLM-as-judge, exact match)
4. Run 3 different models through the suite and rank them
Expected Output: Model comparison leaderboard with multiple metrics