Hallucination Detection & Mitigation¶

Hallucination is not a side bug. It is what happens when a probabilistic generator sounds certain without being sufficiently grounded.

★ TL;DR¶

What: Methods to detect and reduce unsupported, fabricated, or overconfident model outputs
Why: Hallucination is one of the main blockers to production trust in GenAI systems
Key point: The best fix is usually system-level grounding and verification, not just a better prompt

★ Overview¶

Definition¶

A hallucination is an output that is fluent and plausible but unsupported by the available evidence, tool results, or real-world facts.

Scope¶

This note covers hallucination types, detection methods, mitigation strategies, and production patterns. For broader safety context, see Ethics, Safety & Alignment. For retrieval-based grounding, see Retrieval-Augmented Generation (RAG).

Why It Happens¶

Next-token prediction optimizes plausibility, not truth
The model may be missing the needed knowledge
Prompts can force answers even when the model should abstain
Tool or retrieval context may be incomplete or noisy

Prerequisites¶

★ Deep Dive¶

Useful Taxonomy¶

Type	Description	Example
Intrinsic	Contradicts provided context	Model cites a value not present in the retrieved chunk
Extrinsic	Sounds factual but is false outside the context	Fake company, paper, package, or legal clause
Fabricated citation	Invented source or quote	Nonexistent article or benchmark
Reasoning drift	Early steps are plausible, final conclusion is unsupported	Tool traces do not justify the final recommendation

Detection Strategies¶

1. Reference-based checks¶

Use when you have ground truth, citations, or authoritative context.

exact match or overlap for structured answers
contradiction / entailment checks
citation coverage checks
groundedness scores against retrieved passages

2. Reference-free checks¶

Use when no gold answer exists.

self-consistency across repeated generations
verifier model or judge model
uncertainty estimation or abstention scoring
anomaly detection on tool trajectories

3. Production heuristics¶

response contains named entities not present in context
output references tools that were never called
answer format is valid but evidence fields are empty
confidence tone is high while retrieval confidence is low

Practical Detection Stack¶

User request
-> retrieve / call tools
-> generate answer with citations
-> groundedness check
-> format + policy validation
-> optional verifier model
-> answer / abstain / escalate

Mitigation Strategies¶

Strategy	Best For	Limitation
RAG with citations	Knowledge assistants	Depends on retrieval quality
Tool use	Dynamic facts and calculations	Tool outputs can still be misused
Structured output	Workflows and APIs	Does not guarantee truthfulness
Abstain / say "I do not know"	High-risk domains	Can reduce answer coverage
Fine-tuning on domain style	Format and task consistency	Does not guarantee current facts
Verifier pass	Expensive or high-stakes requests	Adds latency and cost

What Works Best In Practice¶

Ground with retrieval or tools
Ask for citations or evidence fields
Add a post-generation groundedness check
Route uncertain or unsafe answers to abstention or human review

Simple Groundedness Pattern¶

# ⚠️ Last tested: 2026-04
def answer_with_check(query, context_chunks, llm, verifier):
    draft = llm.generate(query=query, context=context_chunks)
    verdict = verifier.score(answer=draft, evidence=context_chunks)
    if verdict["grounded"] < 0.7:
        return {
            "status": "abstain",
            "message": "I do not have enough grounded evidence to answer reliably."
        }
    return {"status": "ok", "answer": draft}

Design Guidance¶

If the answer must be up to date, use tools or retrieval
If the answer must be auditable, require citations
If the domain is high risk, allow abstention and human escalation
If the task is repetitive and format-heavy, add structured outputs and regression tests

◆ Quick Reference¶

Signal	Interpretation
High fluency + low evidence coverage	Likely hallucination risk
Wrong citations	Retrieval or citation assembly bug
Repeated invented entities	Model prior overpowering context
Strong answer on empty retrieval	Missing abstention policy

○ Gotchas & Common Mistakes¶

"Lower temperature" is not a full hallucination strategy
Fine-tuning can improve style while still preserving factual failure modes
A judge model can also hallucinate if it is not grounded on evidence
Hallucination should be measured per use case, not only as a generic score

○ Interview Angles¶

Q: What is the most effective way to reduce hallucination in enterprise assistants?
A: Ground the answer on retrieval or tool outputs, require evidence in the response path, and add a post-generation verification step with abstention when confidence is low.
Q: How do detection and mitigation differ?
A: Detection estimates whether an answer is unsupported. Mitigation changes the system so unsupported answers happen less often or are blocked before the user sees them.

★ Code & Implementation¶

Production Groundedness Checker with Abstention¶

# pip install openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY env var
from openai import OpenAI
import json

client = OpenAI()

def check_groundedness(answer: str, context_chunks: list[str], threshold: float = 0.7) -> dict:
    """Ask a judge LLM whether an answer is grounded in the given context."""
    context = "\n\n".join(f"[Chunk {i+1}]: {c}" for i, c in enumerate(context_chunks))
    prompt = (
        "You are a factuality judge. Score whether the ANSWER is supported by the CONTEXT.\n\n"
        f"CONTEXT:\n{context}\n\nANSWER:\n{answer}\n\n"
        'JSON only: {"score": 0.0-1.0, "reason": "...", "unsupported_claims": ["..."]}'
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    result = json.loads(resp.choices[0].message.content)
    result["grounded"] = result["score"] >= threshold
    return result

# Wrong year - should be flagged
context = ["The Eiffel Tower was built in 1889 and stands 330 meters tall."]
answer  = "The Eiffel Tower was built in 1890."
check   = check_groundedness(answer, context)
print(f"Grounded: {check['grounded']} | Score: {check['score']:.2f}")
print(f"Unsupported: {check.get('unsupported_claims', [])}")

Self-Consistency Check (Reference-Free)¶

# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY
from collections import Counter

def self_consistency_check(question: str, n: int = 5) -> dict:
    answers = [
        client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": question}],
            temperature=0.8, max_tokens=80,
        ).choices[0].message.content.strip()
        for _ in range(n)
    ]
    top, freq = Counter(answers).most_common(1)[0]
    return {"answer": top, "consistency": freq / n, "confident": freq / n >= 0.6}

r = self_consistency_check("What year was the Eiffel Tower built?")
print(f"{r['answer']} ({r['consistency']:.0%} agreement, confident={r['confident']})")

RAGAS Faithfulness Score (LLM-as-Judge for RAG Groundedness)¶

# pip install ragas>=0.2 openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: ragas>=0.2, OPENAI_API_KEY

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question": [
        "When was the Eiffel Tower built?",
        "What is the capital of Germany?",
    ],
    "answer": [
        "The Eiffel Tower was built in 1890.",    # Hallucinated (should be 1889)
        "The capital of Germany is Berlin.",       # Correct
    ],
    "contexts": [
        ["The Eiffel Tower was built in 1889 and stands 330 meters tall."],
        ["Berlin is the capital and largest city of Germany."],
    ],
    "ground_truth": [
        "The Eiffel Tower was built in 1889.",
        "Berlin is the capital of Germany.",
    ],
})

result = evaluate(dataset=eval_data, metrics=[faithfulness, answer_relevancy])
print(result)
# faithfulness ~0.5 (1890 hallucination penalizes), answer_relevancy ~0.95
# result.to_pandas() shows per-row breakdown

★ Connections¶

Relationship	Topics
Builds on	Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), LLM Evaluation & Benchmarks
Leads to	Ethics, Safety & Alignment, Advanced Fine-Tuning for LLM Adaptation, LLM Evaluation & Benchmarks
Compare with	Generic model quality issues, prompt injection, data leakage
Cross-domain	Information retrieval, fact checking, uncertainty estimation

◆ Hands-On Exercises¶

Exercise 1: Build a Hallucination Detection Pipeline¶

Goal: Create a multi-method hallucination detection system Time: 45 minutes Steps: 1. Generate 20 LLM responses on factual questions 2. Implement 3 detection methods: self-consistency, retrieval-grounding, NLI 3. Label each response as factual/hallucinated using all 3 methods 4. Compare detection accuracy against human annotations Expected Output: Detection method comparison table with precision/recall

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
False positive refusals	System flags accurate responses as hallucinations	Detection threshold too aggressive	Calibrate thresholds on domain data, multi-method consensus
Confident hallucinations	Model hallucinates with high confidence scores	Confidence ≠ correctness for LLMs	Retrieval grounding, self-consistency checks, citation verification
Detection latency	Real-time hallucination check adds 2-5s per response	Detection method too compute-intensive	Lightweight pre-filter, async verification, batch checking
Judge model hallucination	LLM-as-judge marks correct answers as hallucinated	Judge model itself is not grounded	Provide evidence to the judge; ensemble multiple judges
Domain drift	Detection accuracy degrades on new document types	Detector calibrated on different corpus	Domain-specific threshold calibration, periodic re-evaluation
Self-consistency collapse	All N samples agree on a wrong systematic answer	Model has systematic bias on this fact	Combine with retrieval grounding; never rely on consistency alone

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Min et al. "FActScore" (2023)	Fine-grained factuality scoring for LLM outputs
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 4	Hallucination detection as part of evaluation strategy
🔧 Hands-on	Vectara HHEM	Open-source hallucination evaluation model

★ Sources¶

SelfCheckGPT paper
RAGAS documentation
NLI / entailment literature for factual verification
Ethics, Safety & Alignment