Guardrails & Content Filtering¶
✨ Bit: LLMs are powerful but unpredictable. Guardrails are the safety barriers that keep your AI from generating toxic content, leaking data, executing unauthorized actions, or hallucinating medical advice. They're not optional in production.
★ TL;DR¶
- What: Input validation, output filtering, and behavioral constraints applied to LLM systems to ensure safety, compliance, and quality
- Why: Without guardrails, LLMs can generate harmful content, leak PII, follow injection attacks, or produce outputs that violate regulations.
- Key point: Guardrails operate at 3 layers — input (block bad requests), model (constrain behavior), and output (validate before delivery) — and must be fast enough to not destroy latency.
★ Overview¶
Definition¶
Guardrails are the programmatic checks, filters, and constraints applied before, during, and after LLM inference to ensure outputs are safe, accurate, compliant, and on-topic.
Scope¶
Covers: Input validation, output filtering, PII detection, topic boundaries, hallucination guards, structured output enforcement, and production implementation. For adversarial attacks, see Adversarial ML. For security checklist, see OWASP Top 10.
Significance¶
- Regulatory requirement: EU AI Act, HIPAA, SOC2 all require content controls
- Brand safety: One toxic response can go viral and damage trust
- Production necessity: Every production LLM system needs guardrails — the question is which ones
Prerequisites¶
★ Deep Dive¶
Three-Layer Guardrail Architecture¶
USER INPUT
│
▼
┌──────────────────────────────────────────┐
│ LAYER 1: INPUT GUARDS │
│ │
│ • Prompt injection detection │
│ • PII detection & redaction │
│ • Topic boundary check │
│ • Input length / cost limits │
│ • Rate limiting │
│ │
│ ↓ BLOCKED → return rejection message │
│ ↓ PASSED → continue to model │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ LAYER 2: MODEL CONSTRAINTS │
│ │
│ • System prompt with behavioral rules │
│ • Temperature / token limits │
│ • Structured output enforcement │
│ • Tool call validation │
│ │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ LAYER 3: OUTPUT GUARDS │
│ │
│ • Toxicity / hate speech classifier │
│ • PII leakage detection │
│ • Hallucination check (if applicable) │
│ • Schema validation │
│ • Competitor mention filter │
│ • Citation verification │
│ │
│ ↓ FAILED → fallback response or retry │
│ ↓ PASSED → return to user │
└──────────────────────────────────────────┘
│
▼
USER RESPONSE
Guardrail Types¶
| Guardrail | Layer | What It Catches | Implementation |
|---|---|---|---|
| Prompt injection | Input | Attempts to override system instructions | Regex + classifier (see Adversarial ML) |
| PII detection | Input + Output | SSN, credit cards, emails, phone numbers | Regex + NER model (Presidio, spaCy) |
| Topic boundaries | Input | Off-topic requests (e.g., political opinions) | Classifier or system prompt enforcement |
| Toxicity filter | Output | Hate speech, violence, sexual content | OpenAI Moderation API, Perspective API |
| Hallucination guard | Output | Ungrounded claims, fabricated citations | Cross-reference with retrieved sources |
| Schema validation | Output | Malformed JSON, missing fields | Pydantic / JSON Schema validation |
| Cost guard | Input | Excessive token usage, prompt injection via length | Token counting + budget enforcement |
| Tool call validation | Output | Unauthorized tool calls, dangerous parameters | Allowlist of tools + parameter validation |
★ Code & Implementation¶
Production Guardrails Pipeline¶
# pip install openai>=1.0 pydantic>=2.0
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.0
import re
from openai import OpenAI
from pydantic import BaseModel, ValidationError
from typing import Optional
client = OpenAI()
# --- INPUT GUARDS ---
# PII patterns
PII_PATTERNS = {
"ssn": r"\b\d{3}-\d{2}-\d{4}\b",
"credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
"email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
"phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
}
def check_pii(text: str) -> dict:
"""Detect PII in text."""
found = {}
for pii_type, pattern in PII_PATTERNS.items():
matches = re.findall(pattern, text)
if matches:
found[pii_type] = len(matches)
return {"has_pii": bool(found), "types": found}
def redact_pii(text: str) -> str:
"""Replace PII with redaction markers."""
for pii_type, pattern in PII_PATTERNS.items():
text = re.sub(pattern, f"[REDACTED_{pii_type.upper()}]", text)
return text
# Injection detection (simplified — see adversarial-ml note for full version)
INJECTION_PATTERNS = [
r"ignore (all |previous )?instructions",
r"you are now",
r"system prompt",
r"forget everything",
]
def check_injection(text: str) -> bool:
"""Check for prompt injection attempts."""
return any(re.search(p, text, re.IGNORECASE) for p in INJECTION_PATTERNS)
# --- OUTPUT GUARDS ---
def check_toxicity(text: str) -> dict:
"""Use OpenAI Moderation API to check for toxic content."""
response = client.moderations.create(input=text)
result = response.results[0]
return {
"flagged": result.flagged,
"categories": {k: v for k, v in result.categories.model_dump().items() if v},
}
# --- FULL PIPELINE ---
class GuardrailResult(BaseModel):
allowed: bool
response: Optional[str] = None
blocked_reason: Optional[str] = None
pii_redacted: bool = False
model_used: str = ""
def guarded_completion(user_input: str, system_prompt: str) -> GuardrailResult:
"""Complete LLM request with full guardrail pipeline."""
# 1. INPUT: Check injection
if check_injection(user_input):
return GuardrailResult(
allowed=False,
blocked_reason="Potential prompt injection detected",
)
# 2. INPUT: Check and redact PII
pii_check = check_pii(user_input)
clean_input = redact_pii(user_input) if pii_check["has_pii"] else user_input
# 3. MODEL: Generate with constraints
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": clean_input},
],
temperature=0.3,
max_tokens=500,
)
output = response.choices[0].message.content
# 4. OUTPUT: Check toxicity
toxicity = check_toxicity(output)
if toxicity["flagged"]:
return GuardrailResult(
allowed=False,
blocked_reason=f"Output flagged for: {list(toxicity['categories'].keys())}",
)
# 5. OUTPUT: Check for PII leakage
output_pii = check_pii(output)
if output_pii["has_pii"]:
output = redact_pii(output)
return GuardrailResult(
allowed=True,
response=output,
pii_redacted=pii_check["has_pii"] or output_pii["has_pii"],
model_used="gpt-4o-mini",
)
# Test
result = guarded_completion(
"My SSN is 123-45-6789. Can you help me file taxes?",
"You are a helpful tax assistant. Never repeat personal information."
)
print(result.model_dump_json(indent=2))
# Expected: PII redacted, response generated without SSN
◆ Quick Reference¶
GUARDRAIL PRIORITY (implement in this order):
1. Prompt injection detection — prevents control hijacking
2. PII detection & redaction — prevents data leakage
3. Output toxicity filtering — prevents brand damage
4. Schema validation — prevents downstream errors
5. Topic boundaries — keeps agent on-task
6. Hallucination checking — prevents misinformation
7. Cost guards — prevents budget blowout
LATENCY BUDGET:
Input guards: < 50ms (regex + classifier)
Output guards: < 100ms (moderation API + validation)
Total overhead: < 150ms added to request
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Over-blocking | Legitimate users get blocked frequently | Guards too aggressive, high false positive rate | Tune thresholds, add human review for borderline cases |
| Guardrail bypass | Bad content gets through despite guards | Adversarial input that evades pattern matching | Layer multiple detection methods, adversarial testing |
| Latency bloat | 500ms+ added per request from guardrails | Too many synchronous guards, slow toxicity API | Parallelize guards, cache repeated checks, use fast models |
| PII leakage | Model outputs user PII from context | PII in system prompt or retrieved context | Redact PII before model sees it, output PII scanning |
○ Interview Angles¶
- Q: Design a guardrail system for a healthcare chatbot.
- A: Three-layer approach. Input: PII detection (redact SSN, DOB before model sees them), injection detection, and topic filter (reject non-health queries). Model: system prompt with strict medical disclaimer rules, temperature=0 for consistency, structured output for treatment recommendations. Output: medical claim classifier (flag unverified treatment claims), PII leakage check, mandatory disclaimer injection. I'd add a HIPAA compliance layer that logs all interactions without PII for audit. Latency budget: < 200ms total guardrail overhead. For high-risk responses (medication, diagnosis), add a human-review queue.
◆ Hands-On Exercises¶
Exercise 1: Build a Guardrailed Chatbot¶
Goal: Add input and output guards to a basic chatbot Time: 45 minutes Steps: 1. Build a basic chatbot with the OpenAI API 2. Add PII detection (regex-based) on input and output 3. Add prompt injection detection (regex + LLM classifier) 4. Add toxicity checking (OpenAI Moderation API) 5. Test with 10 adversarial inputs — how many get caught? Expected Output: Guardrailed chatbot with attack resistance log
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Adversarial ML, OWASP Top 10, LLMOps |
| Leads to | Healthcare AI compliance, Financial AI regulation, Safe agent deployment |
| Compare with | Traditional input validation, WAF (Web Application Firewall) |
| Cross-domain | AppSec, Compliance, Content moderation, RegTech |
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 🔧 Hands-on | Guardrails AI | Open-source guardrails framework with validators |
| 🔧 Hands-on | NeMo Guardrails (NVIDIA) | Programmable guardrails for LLM applications |
| 🔧 Hands-on | Microsoft Presidio | PII detection and de-identification |
| 📘 Book | "AI Engineering" by Chip Huyen (2025), Ch 6 | Safety and guardrail patterns in production |
★ Sources¶
- Guardrails AI — https://www.guardrailsai.com/
- NVIDIA NeMo Guardrails — https://github.com/NVIDIA/NeMo-Guardrails
- Microsoft Presidio — https://microsoft.github.io/presidio/
- OpenAI Moderation API — https://platform.openai.com/docs/guides/moderation
- Adversarial ML & AI Security