Skip to content

LLM Routing & Model Selection

Bit: Using GPT-4 for everything is like taking an ambulance to the grocery store. Model routing sends simple requests to cheap/fast models and hard requests to powerful/expensive ones — cutting costs 60-80% with minimal quality loss.


★ TL;DR

  • What: Techniques for dynamically selecting which LLM handles each request based on task complexity, cost, latency, and quality requirements
  • Why: LLM costs vary 100× between models (Gemini Flash: $0.075/M vs Claude Opus: $75/M input tokens). Routing simple tasks to cheap models saves enormous money.
  • Key point: A well-tuned router sends 70-80% of traffic to cheap models, 15-25% to mid-tier, and < 5% to expensive models — reducing average cost per request by 5-10× while maintaining quality.

★ Overview

Definition

LLM routing is the practice of using a classifier, heuristic, or meta-model to select the optimal LLM for each incoming request based on task difficulty, cost constraints, latency requirements, and quality thresholds.

Scope

Covers: Routing strategies (rule-based, classifier-based, cascade), model selection frameworks, cost-quality tradeoff analysis, and production implementation. For broader cost optimization, see Cost Optimization. For model landscape, see LLM Landscape.

Significance

  • Cost is the #1 production AI concern: Most teams overspend by 5-10× using a single expensive model for all requests
  • Latency varies dramatically: Flash/Haiku models respond in 200ms, Opus/GPT-4 in 2-5 seconds
  • Interview staple: "How would you reduce LLM costs by 80% without losing quality?" tests this directly

Prerequisites


★ Deep Dive

The Cost-Quality Spectrum (April 2026)

MODEL TIERS (per 1M input tokens):

  TIER 1: CHEAP & FAST              $0.075 - $0.25
    Gemini Flash, GPT-4o-mini, Claude Haiku
    Use for: classification, extraction, simple Q&A, reformatting
    Latency: 100-300ms TTFT

  TIER 2: MID-RANGE                 $1.00 - $5.00
    GPT-4o, Claude Sonnet, Gemini Pro
    Use for: complex Q&A, summarization, code generation
    Latency: 300-800ms TTFT

  TIER 3: POWERFUL & EXPENSIVE      $15.00 - $75.00
    Claude Opus, GPT-4 Turbo, o1-pro
    Use for: complex reasoning, hard code, multi-step analysis
    Latency: 1-5s TTFT

  COST DIFFERENCE: Tier 3 is up to 1000× more expensive than Tier 1

  QUALITY DIFFERENCE: For simple tasks, Tier 1 ≈ Tier 3 quality
                      For hard tasks, Tier 3 >> Tier 1 quality

Routing Strategies

┌──────────────────────────────────────────────────────────────────┐
│                    ROUTING STRATEGIES                              │
│                                                                    │
│  STRATEGY 1: RULE-BASED                                           │
│  ┌──────────────────────────────────────────────┐                 │
│  │ if task_type == "classify": use Haiku         │                 │
│  │ if task_type == "summarize": use Sonnet       │                 │
│  │ if task_type == "reason": use Opus            │                 │
│  └──────────────────────────────────────────────┘                 │
│  ✅ Simple, predictable, no overhead                              │
│  ❌ Can't handle ambiguous requests                               │
│                                                                    │
│  STRATEGY 2: CLASSIFIER-BASED                                     │
│  ┌──────────────────────────────────────────────┐                 │
│  │ Train a small classifier on labeled examples   │                │
│  │ Input: user request → Output: model tier       │                │
│  │ Use: logistic regression, small BERT, or LLM   │                │
│  └──────────────────────────────────────────────┘                 │
│  ✅ Handles nuance, data-driven                                   │
│  ❌ Needs labeled data, can misroute                              │
│                                                                    │
│  STRATEGY 3: CASCADE (TRY CHEAP FIRST)                            │
│  ┌──────────────────────────────────────────────┐                 │
│  │ 1. Send to cheap model                         │                │
│  │ 2. Check confidence / quality score            │                │
│  │ 3. If below threshold → escalate to expensive  │                │
│  └──────────────────────────────────────────────┘                 │
│  ✅ No misrouting (always has fallback)                           │
│  ❌ Adds latency for escalated requests                           │
│                                                                    │
│  STRATEGY 4: LLM-AS-ROUTER                                       │
│  ┌──────────────────────────────────────────────┐                 │
│  │ Use a cheap LLM to classify difficulty:        │                │
│  │ "Rate this query 1-3 for complexity"            │                │
│  │ Route based on the rating                       │                │
│  └──────────────────────────────────────────────┘                 │
│  ✅ Flexible, understands context                                 │
│  ❌ Adds cost and latency for the routing call                    │
└──────────────────────────────────────────────────────────────────┘

When to Use Each Strategy

Strategy Best When Overhead Accuracy
Rule-based Known task types (API with defined endpoints) Zero Medium
Classifier High traffic, labeled historical data available ~5ms High
Cascade Quality is critical, can't afford misrouting +200ms for escalations Highest
LLM-as-router Diverse queries, no labeled data yet +100-200ms, +$0.0001 High

★ Code & Implementation

LLM Router with Cascade Fallback

# pip install openai>=1.0
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.0

from openai import OpenAI
import json, time

client = OpenAI()

# Model tier configuration
MODELS = {
    "cheap": {"name": "gpt-4o-mini", "input_cost": 0.15, "output_cost": 0.60},
    "mid":   {"name": "gpt-4o",      "input_cost": 2.50, "output_cost": 10.00},
    "expensive": {"name": "gpt-4-turbo", "input_cost": 10.0, "output_cost": 30.0},
}

def classify_difficulty(query: str) -> str:
    """Use cheap LLM to classify query difficulty."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Classify the difficulty of this user query.
Output JSON: {"difficulty": "easy"|"medium"|"hard", "reason": "brief explanation"}

easy: simple facts, formatting, classification, short answers
medium: summarization, code generation, multi-step reasoning
hard: complex analysis, mathematical proofs, novel code architecture"""
        }, {
            "role": "user",
            "content": query,
        }],
        response_format={"type": "json_object"},
        temperature=0,
        max_tokens=100,
    )
    result = json.loads(response.choices[0].message.content)
    return result["difficulty"]

def route_and_respond(query: str, messages: list[dict] = None) -> dict:
    """Route query to appropriate model and return response with metadata."""
    start = time.time()

    # Step 1: Classify difficulty
    difficulty = classify_difficulty(query)
    tier_map = {"easy": "cheap", "medium": "mid", "hard": "expensive"}
    tier = tier_map[difficulty]
    model_config = MODELS[tier]

    # Step 2: Generate response with selected model
    msgs = messages or [{"role": "user", "content": query}]
    response = client.chat.completions.create(
        model=model_config["name"],
        messages=msgs,
    )

    # Step 3: Calculate cost
    usage = response.usage
    cost = (
        usage.prompt_tokens * model_config["input_cost"] / 1_000_000 +
        usage.completion_tokens * model_config["output_cost"] / 1_000_000
    )

    return {
        "content": response.choices[0].message.content,
        "model_used": model_config["name"],
        "difficulty": difficulty,
        "cost_usd": f"${cost:.6f}",
        "latency_ms": round((time.time() - start) * 1000),
    }

# Test
print(route_and_respond("What is 2+2?"))
# Expected: {"model_used": "gpt-4o-mini", "difficulty": "easy", "cost_usd": "$0.000030", ...}

print(route_and_respond("Write a async Python web scraper with rate limiting and retry logic"))
# Expected: {"model_used": "gpt-4o", "difficulty": "medium", "cost_usd": "$0.005000", ...}

print(route_and_respond("Prove that P ≠ NP or explain the key obstacles to a proof"))
# Expected: {"model_used": "gpt-4-turbo", "difficulty": "hard", "cost_usd": "$0.010000", ...}

◆ Quick Reference

ROUTING DECISION GUIDE:

  Known task types (APIs)?          → Rule-based routing
  High traffic + labeled data?      → Train a classifier
  Quality-critical + can't misroute? → Cascade (try cheap first)
  Cold start / no labeled data?     → LLM-as-router

COST SAVINGS ESTIMATES:
  No routing (all GPT-4):           $100/day baseline
  Rule-based routing:               $30-50/day (50-70% savings)
  Classifier routing:               $15-25/day (75-85% savings)
  Cascade with confidence:          $20-30/day (70-80% savings)

TRAFFIC DISTRIBUTION TARGET:
  Tier 1 (cheap):    70-80% of requests
  Tier 2 (mid):      15-25% of requests
  Tier 3 (expensive): 2-5% of requests

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Under-routing Hard queries sent to cheap model, quality drops Router classifier too aggressive on cost Monitor quality per route, set minimum quality thresholds
Over-routing Most traffic goes to expensive model, costs high Router too conservative, scared of quality drops Track routing distribution, set target % per tier
Router latency Added 200ms+ for every request from routing overhead LLM-as-router is slow, classifier model is large Use lightweight classifier (~5ms), cache routing decisions
Cascade cost explosion Escalation rate too high, negating savings Confidence threshold too strict, cheap model underperforms Tune threshold on labeled data, improve cheap model prompt

○ Interview Angles

  • Q: How would you reduce LLM costs by 80% without losing quality?
  • A: Model routing. I'd analyze our traffic and find that 70-80% of requests are simple (classification, extraction, formatting) and can be handled by a cheap model like GPT-4o-mini or Gemini Flash at 1/100th the cost of GPT-4. I'd implement a classifier-based router trained on labeled examples of easy/medium/hard queries. For the remaining 20-30% of complex requests, I'd use a mid-tier model, reserving expensive models (GPT-4, Opus) for only the hardest 2-5%. I'd monitor quality per route with automated evals and adjust thresholds weekly. Expected savings: 5-10× reduction in average cost per request.

◆ Hands-On Exercises

Exercise 1: Build a Cost-Optimizing Router

Goal: Implement model routing that reduces costs by 5× Time: 45 minutes Steps: 1. Collect 50 example queries spanning easy/medium/hard 2. Implement the LLM-as-router from the code section 3. Process all 50 queries, measure cost per request for each model tier 4. Compare: all-GPT-4 cost vs routed cost → calculate savings Expected Output: Cost comparison table showing 5-10× savings with routing


★ Connections

Relationship Topics
Builds on Cost Optimization, LLM Landscape, Model Serving
Leads to AI platform engineering, autonomous cost management
Compare with Static model selection, manual A/B testing
Cross-domain Load balancing, API gateway routing, CDN edge logic

Type Resource Why
📄 Paper Ding et al. "RouteLLM" (2024) Academic approach to cost-aware LLM routing
🔧 Hands-on Martian Router Commercial LLM routing service
🔧 Hands-on Artificial Analysis Compare model speed/cost/quality for routing decisions
📘 Book "AI Engineering" by Chip Huyen (2025), Ch 9 Cost-aware architecture patterns including routing

★ Sources

  • Ding et al. "RouteLLM: Learning to Route LLMs with Preference Data" (2024)
  • OpenAI, Anthropic, Google AI pricing pages (April 2026)
  • Cost Optimization
  • LLM Landscape