API Design for AI Applications¶

AI APIs are not just "POST text, get text." They need to handle latency, streaming, structured outputs, cost, retries, and sometimes long-running workflows.

★ TL;DR¶

What: The design patterns for building application-facing APIs around AI systems.
Why: Poor API design leaks model quirks, makes clients brittle, and turns AI product iteration into integration pain.
Key point: Design APIs around product tasks and operational constraints, not around raw model endpoints alone.

★ Overview¶

Definition¶

An AI application API is the contract between clients and an AI-backed service. It defines request shape, response shape, streaming behavior, error semantics, and operational guarantees.

Scope¶

This note focuses on product-facing APIs, not provider SDK specifics. It covers synchronous and asynchronous patterns, structured outputs, feedback capture, and versioning.

Significance¶

Good APIs hide model churn from clients.
Streaming, idempotency, and traceability matter more in AI apps than many teams expect.
This note is especially useful for AI engineer and integration roles.

Prerequisites¶

★ Deep Dive¶

Core Design Questions¶

Ask:

Is this request interactive or long-running?
Does the client need plain text, structured JSON, or streaming tokens?
What errors are retryable?
How will versioning work when prompts or models change?
How will clients attach user identity, tenancy, or trace metadata?

Common AI API Patterns¶

Pattern	Best For	Example
Sync request/response	short tasks	rewrite text, classify, extract fields
Streaming response	chat and copilots	token stream over SSE or WebSocket
Async job API	long-running workflows	large document summarization
Webhook callback	background completion	batch generation pipeline
Session API	conversational systems	multi-turn chat state

Good Request Design¶

Separate:

task input from runtime config
user data from control parameters
schema expectations from free-form instructions

Example:

{
  "input": "Extract invoice fields from this text...",
  "response_format": "json",
  "metadata": {
    "tenant_id": "acme",
    "trace_id": "trace_123"
  }
}

Response Design¶

Useful response fields often include:

primary output
citations or evidence when relevant
finish reason
trace or request id
usage metadata when clients need budgeting

Streaming vs Async¶

Choice	Better When
Streaming	the user is waiting interactively
Async job	the job is long, expensive, or multi-stage

Error Design¶

Make these explicit:

validation error
rate limit
upstream model unavailable
policy refusal
timeout
partial failure

Do not collapse all AI problems into 500.

Versioning Strategy¶

Version:

public API contract
output schema
major behavioral modes when necessary

Do not force clients to track every prompt revision.

API Design Heuristics¶

Keep provider-specific details behind the service boundary.
Return stable structured outputs when downstream code depends on them.
Add trace ids for debugging.
Design for partial failure and retry.
Make feedback collection easy.

◆ Quick Reference¶

Need	Better Design Choice
interactive chat	streaming endpoint
10-minute document job	async job + polling or webhook
downstream automation	schema-first JSON response
model/provider churn	stable service contract over model-specific payloads
support debugging	trace ids and usage metadata

○ Gotchas & Common Mistakes¶

Raw provider payload passthrough creates long-term client pain.
Streaming is not always better if the task is short and structured.
Hidden prompt changes can look like API regressions to downstream teams.
Weak error semantics make retry storms more likely.

○ Interview Angles¶

Q: What should an AI API return besides the answer?
A: Usually a request id, status or finish reason, and optionally citations or usage metadata depending on the product. Those fields make debugging, billing, and trust much easier.
Q: When would you choose an async job API?
A: When the workflow is too long or variable for an interactive request, such as large document pipelines, multi-step agent tasks, or offline generation jobs.

★ Connections¶

Relationship	Topics
Builds on	AI System Design for GenAI Applications, Model Serving for LLM Applications
Leads to	integration engineering, conversational APIs, agent platforms
Compare with	provider-native APIs, traditional CRUD APIs
Cross-domain	backend engineering, API governance, DX

★ Code & Implementation¶

FastAPI AI Endpoint with Streaming¶

# pip install fastapi>=0.110 uvicorn>=0.27 openai>=1.0
# ⚠️ Last tested: 2026-04 | Requires: fastapi>=0.110

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json, uuid, time

app = FastAPI()
client = OpenAI()

@app.post("/v1/summarize")
async def summarize(request: dict):
    """AI summarization endpoint with structured response."""
    trace_id = str(uuid.uuid4())
    start = time.time()

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Summarize: {request['text']}"}],
        max_tokens=200,
    )

    return {
        "summary": response.choices[0].message.content,
        "finish_reason": response.choices[0].finish_reason,
        "usage": {"input_tokens": response.usage.prompt_tokens,
                  "output_tokens": response.usage.completion_tokens},
        "trace_id": trace_id,
        "latency_ms": round((time.time() - start) * 1000),
    }

@app.post("/v1/chat/stream")
async def chat_stream(request: dict):
    """Streaming chat endpoint via SSE."""
    def generate():
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=request["messages"],
            stream=True,
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield f"data: {json.dumps({'content': chunk.choices[0].delta.content})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")
# Expected: POST /v1/summarize returns structured JSON with trace_id
# POST /v1/chat/stream returns SSE token stream

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Provider passthrough leak	Client breaks when model changes	Raw provider response exposed to clients	Wrap in stable response schema, version the contract
Retry storm	10× traffic spike after transient failure	Client retries without backoff, server returns 500 for rate limits	Use 429 with Retry-After header, implement client-side exponential backoff
Streaming disconnect	User sees partial response, no error	Long SSE stream interrupted by proxy/LB timeout	Heartbeat pings, configurable timeouts, client reconnection logic
Schema drift	Downstream automation breaks silently	Prompt change alters output structure	Use structured output / JSON schema enforcement, version schemas

◆ Hands-On Exercises¶

Exercise 1: Design an AI API Contract¶

Goal: Design a complete API contract for an AI-powered document extraction service Time: 30 minutes Steps: 1. Define request schema (input document, extraction fields, format preferences) 2. Define response schema (extracted fields, confidence scores, trace_id, usage) 3. Define error responses (validation, rate limit, policy refusal, timeout) 4. Add streaming endpoint for real-time extraction feedback 5. Document with OpenAPI spec Expected Output: Complete OpenAPI spec covering sync, async, and streaming patterns

★ Recommended Resources¶

Type	Resource	Why
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 8	API design patterns for AI-backed services
🔧 Hands-on	OpenAI API Reference	Gold standard for AI API design — study their schema, streaming, and error patterns
🔧 Hands-on	FastAPI Documentation	Best Python framework for building AI APIs with auto-documentation
📄 Paper	Google API Design Guide	Industry-standard API design principles applicable to AI services

★ Sources¶

OpenAPI Specification — https://spec.openapis.org/
Google API Design Guide — https://cloud.google.com/apis/design
OpenAI API Reference — https://platform.openai.com/docs/api-reference
AI System Design