Guardrails & Content Filtering¶

✨ Bit: LLMs are powerful but unpredictable. Guardrails are the safety barriers that keep your AI from generating toxic content, leaking data, executing unauthorized actions, or hallucinating medical advice. They're not optional in production.

★ TL;DR¶

What: Input validation, output filtering, and behavioral constraints applied to LLM systems to ensure safety, compliance, and quality
Why: Without guardrails, LLMs can generate harmful content, leak PII, follow injection attacks, or produce outputs that violate regulations.
Key point: Guardrails operate at 3 layers — input (block bad requests), model (constrain behavior), and output (validate before delivery) — and must be fast enough to not destroy latency.

★ Overview¶

Definition¶

Guardrails are the programmatic checks, filters, and constraints applied before, during, and after LLM inference to ensure outputs are safe, accurate, compliant, and on-topic.

Scope¶

Covers: Input validation, output filtering, PII detection, topic boundaries, hallucination guards, structured output enforcement, and production implementation. For adversarial attacks, see Adversarial ML. For security checklist, see OWASP Top 10.

Significance¶

Regulatory requirement: EU AI Act, HIPAA, SOC2 all require content controls
Brand safety: One toxic response can go viral and damage trust
Production necessity: Every production LLM system needs guardrails — the question is which ones

Prerequisites¶

★ Deep Dive¶

Three-Layer Guardrail Architecture¶

USER INPUT
     │
     ▼
┌──────────────────────────────────────────┐
│         LAYER 1: INPUT GUARDS            │
│                                          │
│  • Prompt injection detection            │
│  • PII detection & redaction             │
│  • Topic boundary check                  │
│  • Input length / cost limits            │
│  • Rate limiting                         │
│                                          │
│  ↓ BLOCKED → return rejection message    │
│  ↓ PASSED → continue to model           │
└──────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────┐
│         LAYER 2: MODEL CONSTRAINTS       │
│                                          │
│  • System prompt with behavioral rules   │
│  • Temperature / token limits            │
│  • Structured output enforcement         │
│  • Tool call validation                  │
│                                          │
└──────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────┐
│         LAYER 3: OUTPUT GUARDS           │
│                                          │
│  • Toxicity / hate speech classifier     │
│  • PII leakage detection                 │
│  • Hallucination check (if applicable)   │
│  • Schema validation                     │
│  • Competitor mention filter             │
│  • Citation verification                 │
│                                          │
│  ↓ FAILED → fallback response or retry   │
│  ↓ PASSED → return to user              │
└──────────────────────────────────────────┘
     │
     ▼
USER RESPONSE

Guardrail Types¶

Guardrail	Layer	What It Catches	Implementation
Prompt injection	Input	Attempts to override system instructions	Regex + classifier (see Adversarial ML)
PII detection	Input + Output	SSN, credit cards, emails, phone numbers	Regex + NER model (Presidio, spaCy)
Topic boundaries	Input	Off-topic requests (e.g., political opinions)	Classifier or system prompt enforcement
Toxicity filter	Output	Hate speech, violence, sexual content	OpenAI Moderation API, Perspective API
Hallucination guard	Output	Ungrounded claims, fabricated citations	Cross-reference with retrieved sources
Schema validation	Output	Malformed JSON, missing fields	Pydantic / JSON Schema validation
Cost guard	Input	Excessive token usage, prompt injection via length	Token counting + budget enforcement
Tool call validation	Output	Unauthorized tool calls, dangerous parameters	Allowlist of tools + parameter validation

★ Code & Implementation¶

Production Guardrails Pipeline¶

# pip install openai>=1.0 pydantic>=2.0
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.0

import re
from openai import OpenAI
from pydantic import BaseModel, ValidationError
from typing import Optional

client = OpenAI()

# --- INPUT GUARDS ---

# PII patterns
PII_PATTERNS = {
    "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
    "credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
    "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
    "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
}

def check_pii(text: str) -> dict:
    """Detect PII in text."""
    found = {}
    for pii_type, pattern in PII_PATTERNS.items():
        matches = re.findall(pattern, text)
        if matches:
            found[pii_type] = len(matches)
    return {"has_pii": bool(found), "types": found}

def redact_pii(text: str) -> str:
    """Replace PII with redaction markers."""
    for pii_type, pattern in PII_PATTERNS.items():
        text = re.sub(pattern, f"[REDACTED_{pii_type.upper()}]", text)
    return text

# Injection detection (simplified — see adversarial-ml note for full version)
INJECTION_PATTERNS = [
    r"ignore (all |previous )?instructions",
    r"you are now",
    r"system prompt",
    r"forget everything",
]

def check_injection(text: str) -> bool:
    """Check for prompt injection attempts."""
    return any(re.search(p, text, re.IGNORECASE) for p in INJECTION_PATTERNS)

# --- OUTPUT GUARDS ---

def check_toxicity(text: str) -> dict:
    """Use OpenAI Moderation API to check for toxic content."""
    response = client.moderations.create(input=text)
    result = response.results[0]
    return {
        "flagged": result.flagged,
        "categories": {k: v for k, v in result.categories.model_dump().items() if v},
    }

# --- FULL PIPELINE ---

class GuardrailResult(BaseModel):
    allowed: bool
    response: Optional[str] = None
    blocked_reason: Optional[str] = None
    pii_redacted: bool = False
    model_used: str = ""

def guarded_completion(user_input: str, system_prompt: str) -> GuardrailResult:
    """Complete LLM request with full guardrail pipeline."""

    # 1. INPUT: Check injection
    if check_injection(user_input):
        return GuardrailResult(
            allowed=False,
            blocked_reason="Potential prompt injection detected",
        )

    # 2. INPUT: Check and redact PII
    pii_check = check_pii(user_input)
    clean_input = redact_pii(user_input) if pii_check["has_pii"] else user_input

    # 3. MODEL: Generate with constraints
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": clean_input},
        ],
        temperature=0.3,
        max_tokens=500,
    )
    output = response.choices[0].message.content

    # 4. OUTPUT: Check toxicity
    toxicity = check_toxicity(output)
    if toxicity["flagged"]:
        return GuardrailResult(
            allowed=False,
            blocked_reason=f"Output flagged for: {list(toxicity['categories'].keys())}",
        )

    # 5. OUTPUT: Check for PII leakage
    output_pii = check_pii(output)
    if output_pii["has_pii"]:
        output = redact_pii(output)

    return GuardrailResult(
        allowed=True,
        response=output,
        pii_redacted=pii_check["has_pii"] or output_pii["has_pii"],
        model_used="gpt-4o-mini",
    )

# Test
result = guarded_completion(
    "My SSN is 123-45-6789. Can you help me file taxes?",
    "You are a helpful tax assistant. Never repeat personal information."
)
print(result.model_dump_json(indent=2))
# Expected: PII redacted, response generated without SSN

◆ Quick Reference¶

GUARDRAIL PRIORITY (implement in this order):

  1. Prompt injection detection    — prevents control hijacking
  2. PII detection & redaction     — prevents data leakage
  3. Output toxicity filtering     — prevents brand damage
  4. Schema validation             — prevents downstream errors
  5. Topic boundaries              — keeps agent on-task
  6. Hallucination checking        — prevents misinformation
  7. Cost guards                   — prevents budget blowout

LATENCY BUDGET:
  Input guards:   < 50ms (regex + classifier)
  Output guards:  < 100ms (moderation API + validation)
  Total overhead: < 150ms added to request

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Over-blocking	Legitimate users get blocked frequently	Guards too aggressive, high false positive rate	Tune thresholds, add human review for borderline cases
Guardrail bypass	Bad content gets through despite guards	Adversarial input that evades pattern matching	Layer multiple detection methods, adversarial testing
Latency bloat	500ms+ added per request from guardrails	Too many synchronous guards, slow toxicity API	Parallelize guards, cache repeated checks, use fast models
PII leakage	Model outputs user PII from context	PII in system prompt or retrieved context	Redact PII before model sees it, output PII scanning

○ Interview Angles¶

Q: Design a guardrail system for a healthcare chatbot.
A: Three-layer approach. Input: PII detection (redact SSN, DOB before model sees them), injection detection, and topic filter (reject non-health queries). Model: system prompt with strict medical disclaimer rules, temperature=0 for consistency, structured output for treatment recommendations. Output: medical claim classifier (flag unverified treatment claims), PII leakage check, mandatory disclaimer injection. I'd add a HIPAA compliance layer that logs all interactions without PII for audit. Latency budget: < 200ms total guardrail overhead. For high-risk responses (medication, diagnosis), add a human-review queue.

◆ Hands-On Exercises¶

Exercise 1: Build a Guardrailed Chatbot¶

Goal: Add input and output guards to a basic chatbot Time: 45 minutes Steps: 1. Build a basic chatbot with the OpenAI API 2. Add PII detection (regex-based) on input and output 3. Add prompt injection detection (regex + LLM classifier) 4. Add toxicity checking (OpenAI Moderation API) 5. Test with 10 adversarial inputs — how many get caught? Expected Output: Guardrailed chatbot with attack resistance log

★ Connections¶

Relationship	Topics
Builds on	Adversarial ML, OWASP Top 10, LLMOps
Leads to	Healthcare AI compliance, Financial AI regulation, Safe agent deployment
Compare with	Traditional input validation, WAF (Web Application Firewall)
Cross-domain	AppSec, Compliance, Content moderation, RegTech

★ Recommended Resources¶

Type	Resource	Why
🔧 Hands-on	Guardrails AI	Open-source guardrails framework with validators
🔧 Hands-on	NeMo Guardrails (NVIDIA)	Programmable guardrails for LLM applications
🔧 Hands-on	Microsoft Presidio	PII detection and de-identification
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 6	Safety and guardrail patterns in production

★ Sources¶

Guardrails AI — https://www.guardrailsai.com/
NVIDIA NeMo Guardrails — https://github.com/NVIDIA/NeMo-Guardrails
Microsoft Presidio — https://microsoft.github.io/presidio/
OpenAI Moderation API — https://platform.openai.com/docs/guides/moderation
Adversarial ML & AI Security