Structured Outputs & Constrained Generation¶
✨ Bit: JSON Mode tells the model "give me valid JSON." Structured Outputs tells the model "give me this exact schema, or nothing." The difference is the difference between hoping and enforcing.
★ TL;DR¶
- What: Techniques that force LLMs to produce output conforming to a specific schema — guaranteed structurally valid
- Why: Production data pipelines, tool calling, and API integrations require deterministic structure, not free-form text
- Key point: Native constrained decoding (token masking) is the 2026 standard — 100% syntactically reliable and faster than prompt-based approaches
★ Overview¶
Definition¶
Structured outputs are LLM generation modes that guarantee the model's response conforms to a pre-defined schema (JSON Schema, Pydantic model, Zod type). Constrained decoding is the underlying mechanism: at each token generation step, the model's probability distribution is masked so it physically cannot emit tokens that violate the schema.
Scope¶
This note covers the spectrum from basic JSON Mode through strict schema enforcement. For function calling and tool use (which shares the underlying mechanism but serves a different purpose), see Function Calling & Structured Output.
Significance¶
- Every production GenAI pipeline that feeds LLM output into downstream code needs structured output
- Understanding constrained decoding is essential for debugging schema failures
- Interview-critical: "How do you ensure LLM output is always valid JSON?" is a standard question
Prerequisites¶
★ Deep Dive¶
The Hierarchy of Output Control¶
| Method | Reliability | Speed | Use When |
|---|---|---|---|
| Prompt instructions ("reply in JSON") | ~70-80% | Baseline | Prototyping only |
JSON Mode (response_format: json_object) |
~95% syntax | Fast | Legacy apps, no schema needed |
| Structured Outputs (strict schema) | 100% syntax | Fast | Production data extraction |
| Function Calling / Tools | 100% syntax | Fast | Agentic tool selection |
How Constrained Decoding Works¶
The model generates tokens one at a time. At each step:
- Compute next-token probabilities as normal
- Convert the JSON Schema into a finite state machine (FSM)
- Based on current FSM state, compute which tokens are valid next
- Mask all invalid tokens to probability zero
- Sample from remaining valid tokens
This means the model literally cannot produce output that violates the schema. It's not post-processing or retry — it's enforced during generation.
Schema: {"name": string, "age": integer}
Token 1: "{" ✓ (must start with {)
Token 2: '"name"' ✓ (required field)
Token 3: ":" ✓ (key-value separator)
Token 4: '"Alice"' ✓ (string value expected)
Token 5: "," ✓ (more fields needed)
Token 6: '"age"' ✓ (required field)
Token 7: ":" ✓ (separator)
Token 8: "30" ✓ (integer expected)
Token 9: "}" ✓ (all fields present)
Token 8: '"thirty"' ✗ MASKED — integer required
Token 5: "}" ✗ MASKED — "age" still required
Provider Comparison (April 2026)¶
| Provider | Feature | Schema Format | Key Strength | Key Limitation |
|---|---|---|---|---|
| OpenAI | response_format: { type: "json_schema", json_schema: {...} } |
JSON Schema | Most mature, widest schema support | Max ~5 levels nesting |
| Anthropic | Tool-based extraction (define a tool for the schema) | Tool input schema | Excellent reasoning during extraction | No native response_format — uses tool workaround |
response_schema in generation_config |
JSON Schema | Integrated with Vertex AI pipelines | Some types unsupported |
Schema Design Best Practices¶
-
Field ordering matters for CoT: Place
reasoningorexplanationfields beforeanswerorresultfields. The model generates sequentially — reasoning first produces better conclusions. -
Flatten deep nesting: Schemas deeper than 3 levels reduce model accuracy. Break complex structures into pipeline stages.
-
Use
descriptionas guidance: Each field'sdescriptionin the schema acts as implicit instructions to the model. -
Enum constraints: For categorical fields, use
enuminstead ofstring— prevents hallucinated categories. -
Nullable fields: Use
nullable: truefor optional data instead of making fields required with defaults.
Semantic Validation: Beyond Syntax¶
Structured outputs guarantee syntactic correctness (valid JSON, correct types), but NOT semantic correctness:
| Syntactically Valid | Semantically Wrong |
|---|---|
{"price": -500.00} |
Prices shouldn't be negative |
{"start_date": "2026-12-31", "end_date": "2026-01-01"} |
End before start |
{"sentiment": "positive"} for a negative review |
Wrong classification |
Always add an application-level validation layer using Pydantic (Python) or Zod (TypeScript).
Self-Hosted Constrained Generation¶
For local/open-weights models:
| Tool | How It Works | Best For |
|---|---|---|
| Outlines | Grammar-based token masking via FSM | Any HuggingFace model, regex/JSON constraints |
| llguidance | Microsoft's constrained generation engine | Azure-hosted models, complex grammars |
| vLLM | --guided-decoding-backend outlines flag |
Production serving with schema enforcement |
| llama.cpp | GBNF grammar support | Local inference on consumer hardware |
Portability Libraries¶
| Library | Approach | Supports |
|---|---|---|
| Instructor | Pydantic-first wrapper, automatic retry on validation failure | OpenAI, Anthropic, Google, Ollama, LiteLLM |
| BAML | Schema-first DSL with built-in retry, validation, and type generation | Multi-provider, TypeScript + Python |
◆ Quick Reference¶
| Problem | Solution |
|---|---|
| Need valid JSON from LLM | Use response_format with strict schema (not just JSON Mode) |
| Schema too complex, model struggles | Flatten nesting, split into pipeline stages |
| Values are valid types but semantically wrong | Add Pydantic/Zod validators, cross-field checks |
| Need same extraction across multiple providers | Use Instructor or BAML for portability |
| Running local model, need structured output | Use Outlines or vLLM's guided decoding |
○ Gotchas & Common Mistakes¶
- JSON Mode is NOT Structured Outputs — JSON Mode only guarantees valid JSON syntax, not your schema
- Model refusals bypass schema enforcement — always check for refusal metadata in the response
- Deep nesting (5+ levels) degrades output quality even with constrained decoding
- Structured outputs don't help with semantic correctness — a model can confidently fill every field with plausible but wrong values
- Token usage increases slightly with strict schemas due to the constrained generation overhead
○ Interview Angles¶
- Q: What's the difference between JSON Mode and Structured Outputs?
-
A: JSON Mode only guarantees the output is syntactically valid JSON — it could be any shape. Structured Outputs enforce a specific JSON Schema using constrained decoding, guaranteeing the output has exactly the right fields, types, and structure. In production, always use Structured Outputs because you need to parse the result programmatically.
-
Q: How does constrained decoding work under the hood?
-
A: The JSON Schema is converted into a finite state machine. At each token generation step, the FSM determines which tokens are legal given the current state. All illegal tokens are masked to zero probability. The model samples only from valid tokens. This means schema violations are mathematically impossible — it's not retry-based, it's enforced during generation.
-
Q: Does structured output guarantee correct answers?
- A: No — it guarantees correct structure, not correct content. A model can output
{"sentiment": "positive"}for a clearly negative review. Structured output is a formatting guarantee, not a factuality guarantee. You still need semantic validation, ground-truth checks, and domain-specific validators.
★ Code & Implementation¶
OpenAI Structured Output with Pydantic Validation¶
# pip install openai>=1.60 pydantic>=2
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, pydantic>=2, OPENAI_API_KEY
import json
from pydantic import BaseModel, field_validator
from openai import OpenAI
client = OpenAI()
class ProductReview(BaseModel):
"""Structured extraction target for product reviews."""
product_name: str
sentiment: str # positive, negative, neutral
rating: float # 1.0 to 5.0
key_points: list[str]
recommendation: bool
@field_validator("rating")
@classmethod
def validate_rating(cls, v: float) -> float:
if not 1.0 <= v <= 5.0:
raise ValueError(f"Rating must be 1.0-5.0, got {v}")
return v
@field_validator("sentiment")
@classmethod
def validate_sentiment(cls, v: str) -> str:
allowed = {"positive", "negative", "neutral"}
if v.lower() not in allowed:
raise ValueError(f"Sentiment must be one of {allowed}")
return v.lower()
schema = {
"type": "json_schema",
"json_schema": {
"name": "product_review",
"strict": True,
"schema": ProductReview.model_json_schema(),
}
}
review_text = """Battery life is incredible — easily lasts 2 days.
Camera is decent but struggles in low light. Build quality feels premium.
Overpriced compared to competitors though."""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract a structured review from the text."},
{"role": "user", "content": review_text},
],
response_format=schema,
temperature=0,
)
# Parse and validate with Pydantic (catches semantic issues)
raw = json.loads(resp.choices[0].message.content)
review = ProductReview(**raw)
print(f"Product: {review.product_name}")
print(f"Sentiment: {review.sentiment} | Rating: {review.rating}/5")
print(f"Key points: {review.key_points}")
print(f"Recommends: {review.recommendation}")
# Expected output:
# Product: [phone/device name]
# Sentiment: positive | Rating: 3.5/5
# Key points: ['Great battery life', 'Decent camera', 'Premium build', 'Overpriced']
# Recommends: True
Anthropic Tool-Based Structured Extraction¶
# pip install anthropic>=0.40
# ⚠️ Last tested: 2026-04 | Requires: anthropic>=0.40, ANTHROPIC_API_KEY
import anthropic
client = anthropic.Anthropic()
# Anthropic uses tool definitions as the structured output mechanism
extract_tool = {
"name": "extract_review",
"description": "Extract structured review data from text",
"input_schema": {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"rating": {"type": "number", "minimum": 1, "maximum": 5},
"key_points": {"type": "array", "items": {"type": "string"}},
},
"required": ["product_name", "sentiment", "rating", "key_points"],
},
}
resp = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
tools=[extract_tool],
tool_choice={"type": "tool", "name": "extract_review"}, # Force tool use
messages=[{"role": "user", "content": f"Extract review data: {review_text}"}],
)
# The structured data is in the tool use block
for block in resp.content:
if block.type == "tool_use":
print(f"Extracted: {block.input}")
# Expected output: {"product_name": "...", "sentiment": "positive", "rating": 3.5, ...}
Outlines: Constrained Generation for Local Models¶
# pip install outlines transformers torch
# ⚠️ Last tested: 2026-04 | Requires: outlines>=0.1, transformers>=4.48, GPU recommended
# Note: For API-based models, use outlines.from_openai(client, "model") instead
import outlines
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
# Define schema as a JSON Schema or Pydantic model
schema = '''{
"type": "object",
"properties": {
"name": {"type": "string"},
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"score": {"type": "integer", "minimum": 1, "maximum": 10}
},
"required": ["name", "sentiment", "score"]
}'''
generator = outlines.generate.json(model, schema)
result = generator("Analyze this review: Great product, fast shipping, would buy again!")
print(result)
# Expected output: {"name": "...", "sentiment": "positive", "score": 9}
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Schema too complex | Model accuracy drops, timeouts, truncated output | >5 nesting levels, massive schemas | Flatten schema, split into pipeline stages, reduce required fields |
| Semantic garbage | Valid JSON with nonsensical values | No business logic validation layer | Pydantic validators, post-processing checks, cross-field validation |
| Model refusal | Empty or refusal response instead of structured data | Content policy triggered by input text | Check refusal metadata, handle gracefully, fallback to less restrictive prompt |
| Provider portability failure | Works on OpenAI, breaks on Anthropic | Different schema support levels, different tool patterns | Use Instructor/BAML for portability, test across providers |
| Enum hallucination | Values outside defined enum set in older models | Model generates plausible-but-invalid category | Stricter schema with explicit enum, validation layer, model upgrade |
◆ Hands-On Exercises¶
Exercise 1: Cross-Provider Extraction Comparison¶
Goal: Compare structured output reliability across 3 providers Time: 30 minutes
Steps:
1. Define a Pydantic model: JobPosting(title, company, salary_range, required_skills, remote_policy)
2. Collect 10 job posting texts from any job board
3. Extract structured data using OpenAI Structured Outputs, Anthropic tool-based extraction, and Google response_schema
4. Score: schema compliance rate, semantic accuracy (manual check), latency
Expected Output: Comparison table showing compliance % and quality per provider
Exercise 2: Build a Validated Extraction Pipeline¶
Goal: Build a production-grade extraction pipeline with semantic validation Time: 45 minutes
Steps:
1. Define a schema for invoice extraction: Invoice(vendor, date, line_items, total, currency)
2. Add Pydantic validators: total must equal sum of line items, date must be valid, currency must be ISO 4217
3. Implement retry logic: if validation fails, re-prompt with the error message
4. Test on 5 sample invoices (create mock text)
Expected Output: Pipeline that achieves 100% schema compliance and 90%+ semantic accuracy
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Function Calling & Structured Output, Prompt Engineering |
| Leads to | Reliable data extraction pipelines, agentic tool use, automated data processing |
| Compare with | Regex parsing, traditional NLP extraction, template-based generation |
| Cross-domain | Data engineering, API design, schema management |
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Docs | OpenAI Structured Outputs Guide | The definitive guide to schema-enforced generation |
| 🔧 Hands-on | Instructor Library | Best tool for multi-provider structured output with Pydantic |
| 📄 Paper | Willard & Louf — "Efficient Guided Generation" (Outlines, 2023) | The paper that formalized constrained decoding via FSMs |
| 🔧 Hands-on | BAML | Schema-first structured output with built-in validation and multi-provider support |
★ Sources¶
- OpenAI Structured Outputs documentation — https://platform.openai.com/docs/guides/structured-outputs
- Anthropic Tool Use documentation — https://docs.anthropic.com/en/docs/build-with-claude/tool-use
- Outlines library — https://github.com/outlines-dev/outlines
- Instructor library — https://python.useinstructor.com/