Structured Outputs & Constrained Generation¶

✨ Bit: JSON Mode tells the model "give me valid JSON." Structured Outputs tells the model "give me this exact schema, or nothing." The difference is the difference between hoping and enforcing.

★ TL;DR¶

What: Techniques that force LLMs to produce output conforming to a specific schema — guaranteed structurally valid
Why: Production data pipelines, tool calling, and API integrations require deterministic structure, not free-form text
Key point: Native constrained decoding (token masking) is the 2026 standard — 100% syntactically reliable and faster than prompt-based approaches

★ Overview¶

Definition¶

Structured outputs are LLM generation modes that guarantee the model's response conforms to a pre-defined schema (JSON Schema, Pydantic model, Zod type). Constrained decoding is the underlying mechanism: at each token generation step, the model's probability distribution is masked so it physically cannot emit tokens that violate the schema.

Scope¶

This note covers the spectrum from basic JSON Mode through strict schema enforcement. For function calling and tool use (which shares the underlying mechanism but serves a different purpose), see Function Calling & Structured Output.

Significance¶

Every production GenAI pipeline that feeds LLM output into downstream code needs structured output
Understanding constrained decoding is essential for debugging schema failures
Interview-critical: "How do you ensure LLM output is always valid JSON?" is a standard question

Prerequisites¶

★ Deep Dive¶

The Hierarchy of Output Control¶

Method	Reliability	Speed	Use When
Prompt instructions ("reply in JSON")	~70-80%	Baseline	Prototyping only
JSON Mode (`response_format: json_object`)	~95% syntax	Fast	Legacy apps, no schema needed
Structured Outputs (strict schema)	100% syntax	Fast	Production data extraction
Function Calling / Tools	100% syntax	Fast	Agentic tool selection

How Constrained Decoding Works¶

The model generates tokens one at a time. At each step:

Compute next-token probabilities as normal
Convert the JSON Schema into a finite state machine (FSM)
Based on current FSM state, compute which tokens are valid next
Mask all invalid tokens to probability zero
Sample from remaining valid tokens

This means the model literally cannot produce output that violates the schema. It's not post-processing or retry — it's enforced during generation.

Schema: {"name": string, "age": integer}

Token 1: "{"         ✓ (must start with {)
Token 2: '"name"'    ✓ (required field)
Token 3: ":"         ✓ (key-value separator)
Token 4: '"Alice"'   ✓ (string value expected)
Token 5: ","         ✓ (more fields needed)
Token 6: '"age"'     ✓ (required field)
Token 7: ":"         ✓ (separator)
Token 8: "30"        ✓ (integer expected)
Token 9: "}"         ✓ (all fields present)

Token 8: '"thirty"'  ✗ MASKED — integer required
Token 5: "}"         ✗ MASKED — "age" still required

Provider Comparison (April 2026)¶

Provider	Feature	Schema Format	Key Strength	Key Limitation
OpenAI	`response_format: { type: "json_schema", json_schema: {...} }`	JSON Schema	Most mature, widest schema support	Max ~5 levels nesting
Anthropic	Tool-based extraction (define a tool for the schema)	Tool input schema	Excellent reasoning during extraction	No native `response_format` — uses tool workaround
Google	`response_schema` in `generation_config`	JSON Schema	Integrated with Vertex AI pipelines	Some types unsupported

Schema Design Best Practices¶

Field ordering matters for CoT: Place reasoning or explanation fields before answer or result fields. The model generates sequentially — reasoning first produces better conclusions.
Flatten deep nesting: Schemas deeper than 3 levels reduce model accuracy. Break complex structures into pipeline stages.
Use description as guidance: Each field's description in the schema acts as implicit instructions to the model.
Enum constraints: For categorical fields, use enum instead of string — prevents hallucinated categories.
Nullable fields: Use nullable: true for optional data instead of making fields required with defaults.

Semantic Validation: Beyond Syntax¶

Structured outputs guarantee syntactic correctness (valid JSON, correct types), but NOT semantic correctness:

Syntactically Valid	Semantically Wrong
`{"price": -500.00}`	Prices shouldn't be negative
`{"start_date": "2026-12-31", "end_date": "2026-01-01"}`	End before start
`{"sentiment": "positive"}` for a negative review	Wrong classification

Always add an application-level validation layer using Pydantic (Python) or Zod (TypeScript).

Self-Hosted Constrained Generation¶

For local/open-weights models:

Tool	How It Works	Best For
Outlines	Grammar-based token masking via FSM	Any HuggingFace model, regex/JSON constraints
llguidance	Microsoft's constrained generation engine	Azure-hosted models, complex grammars
vLLM	`--guided-decoding-backend outlines` flag	Production serving with schema enforcement
llama.cpp	GBNF grammar support	Local inference on consumer hardware

Portability Libraries¶

Library	Approach	Supports
Instructor	Pydantic-first wrapper, automatic retry on validation failure	OpenAI, Anthropic, Google, Ollama, LiteLLM
BAML	Schema-first DSL with built-in retry, validation, and type generation	Multi-provider, TypeScript + Python

◆ Quick Reference¶

Problem	Solution
Need valid JSON from LLM	Use `response_format` with strict schema (not just JSON Mode)
Schema too complex, model struggles	Flatten nesting, split into pipeline stages
Values are valid types but semantically wrong	Add Pydantic/Zod validators, cross-field checks
Need same extraction across multiple providers	Use Instructor or BAML for portability
Running local model, need structured output	Use Outlines or vLLM's guided decoding

○ Gotchas & Common Mistakes¶

JSON Mode is NOT Structured Outputs — JSON Mode only guarantees valid JSON syntax, not your schema
Model refusals bypass schema enforcement — always check for refusal metadata in the response
Deep nesting (5+ levels) degrades output quality even with constrained decoding
Structured outputs don't help with semantic correctness — a model can confidently fill every field with plausible but wrong values
Token usage increases slightly with strict schemas due to the constrained generation overhead

○ Interview Angles¶

Q: What's the difference between JSON Mode and Structured Outputs?
A: JSON Mode only guarantees the output is syntactically valid JSON — it could be any shape. Structured Outputs enforce a specific JSON Schema using constrained decoding, guaranteeing the output has exactly the right fields, types, and structure. In production, always use Structured Outputs because you need to parse the result programmatically.
Q: How does constrained decoding work under the hood?
A: The JSON Schema is converted into a finite state machine. At each token generation step, the FSM determines which tokens are legal given the current state. All illegal tokens are masked to zero probability. The model samples only from valid tokens. This means schema violations are mathematically impossible — it's not retry-based, it's enforced during generation.
Q: Does structured output guarantee correct answers?
A: No — it guarantees correct structure, not correct content. A model can output {"sentiment": "positive"} for a clearly negative review. Structured output is a formatting guarantee, not a factuality guarantee. You still need semantic validation, ground-truth checks, and domain-specific validators.

★ Code & Implementation¶

OpenAI Structured Output with Pydantic Validation¶

# pip install openai>=1.60 pydantic>=2
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, pydantic>=2, OPENAI_API_KEY
import json
from pydantic import BaseModel, field_validator
from openai import OpenAI

client = OpenAI()

class ProductReview(BaseModel):
    """Structured extraction target for product reviews."""
    product_name: str
    sentiment: str  # positive, negative, neutral
    rating: float   # 1.0 to 5.0
    key_points: list[str]
    recommendation: bool

    @field_validator("rating")
    @classmethod
    def validate_rating(cls, v: float) -> float:
        if not 1.0 <= v <= 5.0:
            raise ValueError(f"Rating must be 1.0-5.0, got {v}")
        return v

    @field_validator("sentiment")
    @classmethod
    def validate_sentiment(cls, v: str) -> str:
        allowed = {"positive", "negative", "neutral"}
        if v.lower() not in allowed:
            raise ValueError(f"Sentiment must be one of {allowed}")
        return v.lower()

schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "product_review",
        "strict": True,
        "schema": ProductReview.model_json_schema(),
    }
}

review_text = """Battery life is incredible — easily lasts 2 days.
Camera is decent but struggles in low light. Build quality feels premium.
Overpriced compared to competitors though."""

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract a structured review from the text."},
        {"role": "user", "content": review_text},
    ],
    response_format=schema,
    temperature=0,
)

# Parse and validate with Pydantic (catches semantic issues)
raw = json.loads(resp.choices[0].message.content)
review = ProductReview(**raw)
print(f"Product: {review.product_name}")
print(f"Sentiment: {review.sentiment} | Rating: {review.rating}/5")
print(f"Key points: {review.key_points}")
print(f"Recommends: {review.recommendation}")
# Expected output:
# Product: [phone/device name]
# Sentiment: positive | Rating: 3.5/5
# Key points: ['Great battery life', 'Decent camera', 'Premium build', 'Overpriced']
# Recommends: True

Anthropic Tool-Based Structured Extraction¶

# pip install anthropic>=0.40
# ⚠️ Last tested: 2026-04 | Requires: anthropic>=0.40, ANTHROPIC_API_KEY
import anthropic

client = anthropic.Anthropic()

# Anthropic uses tool definitions as the structured output mechanism
extract_tool = {
    "name": "extract_review",
    "description": "Extract structured review data from text",
    "input_schema": {
        "type": "object",
        "properties": {
            "product_name": {"type": "string"},
            "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
            "rating": {"type": "number", "minimum": 1, "maximum": 5},
            "key_points": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["product_name", "sentiment", "rating", "key_points"],
    },
}

resp = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=500,
    tools=[extract_tool],
    tool_choice={"type": "tool", "name": "extract_review"},  # Force tool use
    messages=[{"role": "user", "content": f"Extract review data: {review_text}"}],
)

# The structured data is in the tool use block
for block in resp.content:
    if block.type == "tool_use":
        print(f"Extracted: {block.input}")
# Expected output: {"product_name": "...", "sentiment": "positive", "rating": 3.5, ...}

Outlines: Constrained Generation for Local Models¶

# pip install outlines transformers torch
# ⚠️ Last tested: 2026-04 | Requires: outlines>=0.1, transformers>=4.48, GPU recommended
# Note: For API-based models, use outlines.from_openai(client, "model") instead
import outlines

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")

# Define schema as a JSON Schema or Pydantic model
schema = '''{
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "score": {"type": "integer", "minimum": 1, "maximum": 10}
    },
    "required": ["name", "sentiment", "score"]
}'''

generator = outlines.generate.json(model, schema)
result = generator("Analyze this review: Great product, fast shipping, would buy again!")
print(result)
# Expected output: {"name": "...", "sentiment": "positive", "score": 9}

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Schema too complex	Model accuracy drops, timeouts, truncated output	>5 nesting levels, massive schemas	Flatten schema, split into pipeline stages, reduce required fields
Semantic garbage	Valid JSON with nonsensical values	No business logic validation layer	Pydantic validators, post-processing checks, cross-field validation
Model refusal	Empty or refusal response instead of structured data	Content policy triggered by input text	Check refusal metadata, handle gracefully, fallback to less restrictive prompt
Provider portability failure	Works on OpenAI, breaks on Anthropic	Different schema support levels, different tool patterns	Use Instructor/BAML for portability, test across providers
Enum hallucination	Values outside defined enum set in older models	Model generates plausible-but-invalid category	Stricter schema with explicit enum, validation layer, model upgrade

◆ Hands-On Exercises¶

Exercise 1: Cross-Provider Extraction Comparison¶

Goal: Compare structured output reliability across 3 providers Time: 30 minutes

Steps: 1. Define a Pydantic model: JobPosting(title, company, salary_range, required_skills, remote_policy) 2. Collect 10 job posting texts from any job board 3. Extract structured data using OpenAI Structured Outputs, Anthropic tool-based extraction, and Google response_schema 4. Score: schema compliance rate, semantic accuracy (manual check), latency

Expected Output: Comparison table showing compliance % and quality per provider

Exercise 2: Build a Validated Extraction Pipeline¶

Goal: Build a production-grade extraction pipeline with semantic validation Time: 45 minutes

Steps: 1. Define a schema for invoice extraction: Invoice(vendor, date, line_items, total, currency) 2. Add Pydantic validators: total must equal sum of line items, date must be valid, currency must be ISO 4217 3. Implement retry logic: if validation fails, re-prompt with the error message 4. Test on 5 sample invoices (create mock text)

Expected Output: Pipeline that achieves 100% schema compliance and 90%+ semantic accuracy

★ Connections¶

Relationship	Topics
Builds on	Function Calling & Structured Output, Prompt Engineering
Leads to	Reliable data extraction pipelines, agentic tool use, automated data processing
Compare with	Regex parsing, traditional NLP extraction, template-based generation
Cross-domain	Data engineering, API design, schema management

★ Recommended Resources¶

Type	Resource	Why
📄 Docs	OpenAI Structured Outputs Guide	The definitive guide to schema-enforced generation
🔧 Hands-on	Instructor Library	Best tool for multi-provider structured output with Pydantic
📄 Paper	Willard & Louf — "Efficient Guided Generation" (Outlines, 2023)	The paper that formalized constrained decoding via FSMs
🔧 Hands-on	BAML	Schema-first structured output with built-in validation and multi-provider support

★ Sources¶

OpenAI Structured Outputs documentation — https://platform.openai.com/docs/guides/structured-outputs
Anthropic Tool Use documentation — https://docs.anthropic.com/en/docs/build-with-claude/tool-use
Outlines library — https://github.com/outlines-dev/outlines
Instructor library — https://python.useinstructor.com/