API Design for AI Applications¶
AI APIs are not just "POST text, get text." They need to handle latency, streaming, structured outputs, cost, retries, and sometimes long-running workflows.
★ TL;DR¶
- What: The design patterns for building application-facing APIs around AI systems.
- Why: Poor API design leaks model quirks, makes clients brittle, and turns AI product iteration into integration pain.
- Key point: Design APIs around product tasks and operational constraints, not around raw model endpoints alone.
★ Overview¶
Definition¶
An AI application API is the contract between clients and an AI-backed service. It defines request shape, response shape, streaming behavior, error semantics, and operational guarantees.
Scope¶
This note focuses on product-facing APIs, not provider SDK specifics. It covers synchronous and asynchronous patterns, structured outputs, feedback capture, and versioning.
Significance¶
- Good APIs hide model churn from clients.
- Streaming, idempotency, and traceability matter more in AI apps than many teams expect.
- This note is especially useful for AI engineer and integration roles.
Prerequisites¶
- AI System Design for GenAI Applications
- Model Serving for LLM Applications
- Function Calling and Structured Output
★ Deep Dive¶
Core Design Questions¶
Ask:
- Is this request interactive or long-running?
- Does the client need plain text, structured JSON, or streaming tokens?
- What errors are retryable?
- How will versioning work when prompts or models change?
- How will clients attach user identity, tenancy, or trace metadata?
Common AI API Patterns¶
| Pattern | Best For | Example |
|---|---|---|
| Sync request/response | short tasks | rewrite text, classify, extract fields |
| Streaming response | chat and copilots | token stream over SSE or WebSocket |
| Async job API | long-running workflows | large document summarization |
| Webhook callback | background completion | batch generation pipeline |
| Session API | conversational systems | multi-turn chat state |
Good Request Design¶
Separate:
- task input from runtime config
- user data from control parameters
- schema expectations from free-form instructions
Example:
{
"input": "Extract invoice fields from this text...",
"response_format": "json",
"metadata": {
"tenant_id": "acme",
"trace_id": "trace_123"
}
}
Response Design¶
Useful response fields often include:
- primary output
- citations or evidence when relevant
- finish reason
- trace or request id
- usage metadata when clients need budgeting
Streaming vs Async¶
| Choice | Better When |
|---|---|
| Streaming | the user is waiting interactively |
| Async job | the job is long, expensive, or multi-stage |
Error Design¶
Make these explicit:
- validation error
- rate limit
- upstream model unavailable
- policy refusal
- timeout
- partial failure
Do not collapse all AI problems into 500.
Versioning Strategy¶
Version:
- public API contract
- output schema
- major behavioral modes when necessary
Do not force clients to track every prompt revision.
API Design Heuristics¶
- Keep provider-specific details behind the service boundary.
- Return stable structured outputs when downstream code depends on them.
- Add trace ids for debugging.
- Design for partial failure and retry.
- Make feedback collection easy.
◆ Quick Reference¶
| Need | Better Design Choice |
|---|---|
| interactive chat | streaming endpoint |
| 10-minute document job | async job + polling or webhook |
| downstream automation | schema-first JSON response |
| model/provider churn | stable service contract over model-specific payloads |
| support debugging | trace ids and usage metadata |
○ Gotchas & Common Mistakes¶
- Raw provider payload passthrough creates long-term client pain.
- Streaming is not always better if the task is short and structured.
- Hidden prompt changes can look like API regressions to downstream teams.
- Weak error semantics make retry storms more likely.
○ Interview Angles¶
- Q: What should an AI API return besides the answer?
-
A: Usually a request id, status or finish reason, and optionally citations or usage metadata depending on the product. Those fields make debugging, billing, and trust much easier.
-
Q: When would you choose an async job API?
- A: When the workflow is too long or variable for an interactive request, such as large document pipelines, multi-step agent tasks, or offline generation jobs.
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | AI System Design for GenAI Applications, Model Serving for LLM Applications |
| Leads to | integration engineering, conversational APIs, agent platforms |
| Compare with | provider-native APIs, traditional CRUD APIs |
| Cross-domain | backend engineering, API governance, DX |
★ Code & Implementation¶
FastAPI AI Endpoint with Streaming¶
# pip install fastapi>=0.110 uvicorn>=0.27 openai>=1.0
# ⚠️ Last tested: 2026-04 | Requires: fastapi>=0.110
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json, uuid, time
app = FastAPI()
client = OpenAI()
@app.post("/v1/summarize")
async def summarize(request: dict):
"""AI summarization endpoint with structured response."""
trace_id = str(uuid.uuid4())
start = time.time()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Summarize: {request['text']}"}],
max_tokens=200,
)
return {
"summary": response.choices[0].message.content,
"finish_reason": response.choices[0].finish_reason,
"usage": {"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens},
"trace_id": trace_id,
"latency_ms": round((time.time() - start) * 1000),
}
@app.post("/v1/chat/stream")
async def chat_stream(request: dict):
"""Streaming chat endpoint via SSE."""
def generate():
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=request["messages"],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield f"data: {json.dumps({'content': chunk.choices[0].delta.content})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
# Expected: POST /v1/summarize returns structured JSON with trace_id
# POST /v1/chat/stream returns SSE token stream
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Provider passthrough leak | Client breaks when model changes | Raw provider response exposed to clients | Wrap in stable response schema, version the contract |
| Retry storm | 10× traffic spike after transient failure | Client retries without backoff, server returns 500 for rate limits | Use 429 with Retry-After header, implement client-side exponential backoff |
| Streaming disconnect | User sees partial response, no error | Long SSE stream interrupted by proxy/LB timeout | Heartbeat pings, configurable timeouts, client reconnection logic |
| Schema drift | Downstream automation breaks silently | Prompt change alters output structure | Use structured output / JSON schema enforcement, version schemas |
◆ Hands-On Exercises¶
Exercise 1: Design an AI API Contract¶
Goal: Design a complete API contract for an AI-powered document extraction service Time: 30 minutes Steps: 1. Define request schema (input document, extraction fields, format preferences) 2. Define response schema (extracted fields, confidence scores, trace_id, usage) 3. Define error responses (validation, rate limit, policy refusal, timeout) 4. Add streaming endpoint for real-time extraction feedback 5. Document with OpenAPI spec Expected Output: Complete OpenAPI spec covering sync, async, and streaming patterns
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📘 Book | "AI Engineering" by Chip Huyen (2025), Ch 8 | API design patterns for AI-backed services |
| 🔧 Hands-on | OpenAI API Reference | Gold standard for AI API design — study their schema, streaming, and error patterns |
| 🔧 Hands-on | FastAPI Documentation | Best Python framework for building AI APIs with auto-documentation |
| 📄 Paper | Google API Design Guide | Industry-standard API design principles applicable to AI services |
★ Sources¶
- OpenAPI Specification — https://spec.openapis.org/
- Google API Design Guide — https://cloud.google.com/apis/design
- OpenAI API Reference — https://platform.openai.com/docs/api-reference
- AI System Design