LLM Landscape & Model Selection (April 2026)¶
✨ Bit: In 2023, GPT-4 was the only frontier model. In March 2026, there are 6+ frontier providers, each with 5+ model variants. Choosing the right model is now a genuine engineering decision — not just "use GPT."
★ TL;DR¶
- What: A comparison of current frontier LLMs and guidance for selecting the right model
- Why: Interviewers ask "which model would you choose for X?" and "open vs closed?" — you need specifics, not generalities
- Key point: There's no single best model. GPT-5.4 for general tasks, Claude Opus 4.6 for long code, Gemini 3.1 for multimodal, LLaMA 4 for self-hosting. Model selection is about TRADEOFFS.
★ Overview¶
Definition¶
This note is a time-sensitive snapshot of the frontier and near-frontier LLM market plus a framework for choosing models based on workload constraints.
Scope¶
Covers major providers, open vs closed trade-offs, and practical selection heuristics. Verify the latest model lineup with vendor release notes before using this note for a current purchasing or architecture decision.
Last verified for the March 2026 market snapshot: 2026-04.
Significance¶
- Model choice affects quality, latency, privacy posture, and cost more than most teams expect.
- Engineers are increasingly expected to explain why a given model fits a workload instead of defaulting to one provider.
★ Deep Dive¶
The Frontier Models (April 2026)¶
OpenAI — GPT-5 Family¶
| Model | Released | Context | Best For |
|---|---|---|---|
| GPT-5.4 | Mar 5, 2026 | 1M tokens | General work: spreadsheets, presentations, tool use |
| GPT-5.4 Thinking | Mar 2026 | 1M | Analytical tasks, shows reasoning steps |
| GPT-5.4 Pro | Mar 2026 | 1M | Highest accuracy (slower, expensive) |
| GPT-5.4 mini | Mar 17, 2026 | 1M | High-volume, cost-efficient, near GPT-5.4 quality |
| GPT-5.4 nano | Mar 17, 2026 | — | Cheapest, fastest: classification, extraction |
| GPT-5.3 Instant | Mar 3, 2026 | — | Rapid conversational responses |
| GPT-5.3-Codex | Feb 5, 2026 | — | Coding agent (Copilot default) |
| GPT-5.4-Cyber | Apr 14, 2026 | 1M | Defensive cybersecurity (limited access via TAC program) |
Google — Gemini 3 Family¶
| Model | Released | Context | Best For |
|---|---|---|---|
| Gemini 3.1 Pro | Feb 19, 2026 | 1M+ | Advanced reasoning (3-tier thinking), multimodal |
| Gemini 3.1 Flash-Lite | Mar 3, 2026 | — | Cost-efficient, high throughput, Pro-level quality |
| Gemini 3.1 Deep Think | 2026 | — | Complex technical problems (AI Ultra subscribers) |
| Gemini 3.1 Flash Image | Feb 26, 2026 | — | High-efficiency image generation |
| Gemini 3.1 Flash Live | Mar 26, 2026 | — | Real-time audio-to-audio, powers Search Live |
Available in: Gemini API, AI Studio, Gemini CLI, Antigravity, Vertex AI, NotebookLM
Anthropic — Claude 4.x Family¶
| Model | Released | Context | Best For |
|---|---|---|---|
| Claude Opus 4.6 | Feb 5, 2026 | 1M tokens | Most capable: code, analysis, long-doc reasoning |
| Claude Sonnet 4.6 | Feb 17, 2026 | 1M tokens | Balanced: default for claude.ai, Claude Cowork |
| Claude Mythos (Preview) | Apr 7, 2026 | — | Gated research preview (~50 orgs, Project Glasswing, defensive cyber) |
Meta — LLaMA 4 Family (Open-Source)¶
| Model | Params | Context | Best For |
|---|---|---|---|
| LLaMA 4 Scout | 17B active / 109B total (MoE, 16 experts) | 10M tokens | Efficiency, single H100, huge context |
| LLaMA 4 Maverick | 17B active / 400B total (MoE, 128 experts) | 1M tokens | Performance, multimodal, open-weight flagship |
| LLaMA 4 Behemoth | 288B active / 2T total | — | Teacher model (still training, not released) |
Meta 2026 roadmap: Mango (generative video) + Avocado (reasoning LLM)
Other Notable Models¶
| Model | By | Key Feature |
|---|---|---|
| DeepSeek-V3 / R1 | DeepSeek | Cost-efficient reasoning, GRPO training, open-source |
| Gemma 4 (E2B/E4B) | Google DeepMind | Ultra-efficient; audio input; April 2, 2026 |
| Gemma 4 (26B MoE) | Google DeepMind | Multimodal (vision+video+audio); hybrid attention; 256K context |
| Gemma 4 (31B Dense) | Google DeepMind | Dense flagship; thinking mode; 256K context; best open-weight reasoning |
| Mistral Large 2 | Mistral AI | European, strong multilingual, 128K context |
| Qwen 2.5 | Alibaba | Leading Chinese LLM, strong code + math |
| Grok-3 | xAI | Real-time X/Twitter data, humor-capable |
Gemma 4 Family (April 2, 2026) — Architecture Deep Dive¶
Gemma 4 represents a major architectural shift from Gemma 3. Key innovations:
| Variant | Params | Context | Modalities | Key Feature |
|---|---|---|---|---|
| Gemma 4 E2B | 2B | 64K | Text + Audio | Embedded; mobile/edge; audio understanding |
| Gemma 4 E4B | 4B | 64K | Text + Audio | Embedded; improved reasoning over E2B |
| Gemma 4 26B | 26B (MoE) | 256K | Text + Vision + Video + Audio | Hybrid attention (local+global); sharing KV cache across heads |
| Gemma 4 31B | 31B (Dense) | 256K | Text + Vision + Video + Audio | Thinking mode; best open-weight reasoning; 2026 benchmark leader |
Architecture innovations: - Hybrid Attention: alternates local (sliding window) and global (full) attention layers for O(n) local + full context at key positions - Dual RoPE: separate positional encodings for local vs global attention layers - PLE (Per-Layer Embeddings): distinct learned embeddings per decoder layer instead of tied across all layers - Shared KV Cache: multiple attention heads share the same key-value cache; reduces KV memory by 3-4x vs standard MHA
Available via: Google AI Studio, Vertex AI, Ollama (ollama pull gemma4), Hugging Face
By Capability (March 2026)¶
| Capability | Best Model(s) | Runner-Up |
|---|---|---|
| General intelligence | GPT-5.4, Claude Opus 4.6 | Gemini 3.1 Pro |
| Coding | GPT-5.3-Codex, Claude Opus 4.6 | Gemini 3.1 Pro |
| Reasoning / Math | GPT-5.4 Thinking, Gemini 3.1 Deep Think | DeepSeek-R1 |
| Multimodal (vision) | Gemini 3.1 Pro (native) | GPT-5.4 |
| Long document | Claude Opus 4.6 (1M, reliable) | LLaMA 4 Scout (10M) |
| Cost efficiency | GPT-5.4 mini/nano, Gemini Flash-Lite | LLaMA 4 Scout (self-host) |
| Open-source | LLaMA 4 Maverick | DeepSeek-V3, Qwen 2.5 |
| Self-hosting | LLaMA 4 Scout (1 GPU!), Qwen | Mistral |
Open vs Closed Models¶
CLOSED SOURCE (API-only): OPEN SOURCE / WEIGHTS:
GPT-5.4, Claude 4.6, Gemini 3.1 LLaMA 4, DeepSeek, Qwen, Mistral
✅ Highest capability ✅ Full control over deployment
✅ Managed, zero-ops ✅ No vendor lock-in
✅ Continuously updated ✅ Fine-tunable (LoRA, full)
✅ Safety/alignment built in ✅ Data stays on your infra
❌ Data leaves your infra ❌ You manage inference infra
❌ Vendor lock-in ❌ 6-12 months behind frontier
❌ Costs scale linearly ❌ Safety is YOUR responsibility
❌ Can't fine-tune deeply ❌ Less multimodal capability
◆ Model Selection Decision Tree¶
START: What's your use case?
│
├── Simple classification, extraction, routing
│ → GPT-5.4 nano or mini (cheapest, fastest)
│
├── General chat / customer support
│ → Claude Sonnet 4.6 or GPT-5.4 (balanced)
│
├── Complex coding / large codebase
│ → Claude Opus 4.6 (best reasoning for code)
│ → GPT-5.3-Codex (if using Copilot)
│
├── Math / science / reasoning
│ → GPT-5.4 Thinking or Gemini 3.1 Deep Think
│
├── Multimodal (images, video, audio input)
│ → Gemini 3.1 Pro (native multimodal, best)
│
├── Data must stay on-premise / regulated industry
│ → LLaMA 4 Scout/Maverick (self-host)
│ → Qwen 2.5 (if Asian market)
│
├── High volume / cost-sensitive
│ → GPT-5.4 nano < mini < Gemini Flash-Lite
│ → Self-host LLaMA 4 Scout (1 GPU, 10M context!)
│
└── Research / experimental
→ DeepSeek (cost-efficient frontier)
→ Multiple models (benchmark on YOUR data)
○ Interview Angles¶
- Q: Which LLM would you choose for a production RAG system?
-
A: Depends on constraints. For highest quality: Claude Opus 4.6 (1M context, best at following complex instructions with citations). For cost efficiency: GPT-5.4 mini (near GPT-5.4 quality at fraction of cost). For data privacy: LLaMA 4 Scout self-hosted (10M context, fits on 1 H100). For multimodal RAG: Gemini 3.1 Pro (native vision for image documents). In practice, use a cheaper model for retrieval/routing and a powerful model for generation.
-
Q: Open source vs closed source — when?
- A: Closed (GPT-5.4, Claude) when: you need cutting-edge capability, have budget, want zero-ops, and your data policies allow API calls. Open (LLaMA 4, Gemma 4, DeepSeek) when: data must stay on-premise (healthcare, finance, government), you need fine-tuning beyond what APIs allow, or cost at scale is prohibitive. Trend in 2026: Gemma 4 31B and LLaMA 4 Scout are competitive with mid-tier closed models while being fully self-hostable.
★ Code & Implementation¶
Multi-Provider LLM API Comparison¶
# pip install openai>=1.60 anthropic>=0.40 google-generativeai>=0.8
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, anthropic>=0.40, google-generativeai>=0.8
# Set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY env vars
import os
from openai import OpenAI
import anthropic
import google.generativeai as genai
prompt = "Explain the transformer attention mechanism in 3 sentences."
# OpenAI
oai_client = OpenAI()
oai = oai_client.chat.completions.create(
model="gpt-4o-mini", # Replace with gpt-5.4 for frontier
messages=[{"role": "user", "content": prompt}],
max_tokens=200,
)
print("OpenAI:", oai.choices[0].message.content[:100])
# Anthropic
ant_client = anthropic.Anthropic()
ant = ant_client.messages.create(
model="claude-3-5-haiku-20241022", # Replace with claude-sonnet-4-6 for frontier
max_tokens=200,
messages=[{"role": "user", "content": prompt}],
)
print("Anthropic:", ant.content[0].text[:100])
# Google Gemini
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
gem = genai.GenerativeModel("gemini-2.0-flash") # Replace with gemini-3.1-pro for frontier
res = gem.generate_content(prompt)
print("Gemini:", res.text[:100])
# Self-hosted Gemma 4 via Ollama (free, no API key):
# ollama pull gemma4 (downloads ~20GB for the 26B MoE variant)
# curl http://localhost:11434/api/generate -d '{"model": "gemma4", "prompt": "..."}'
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Llms Overview, Reasoning Models & Test-Time Compute |
| Leads to | Inference Optimization, Cost Optimization for GenAI Systems, model-routing decisions |
| Compare with | open-weight deployment choices, provider API strategy |
| Cross-domain | procurement, architecture, platform strategy |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Model selection bias | Team always picks the largest/newest model | No structured evaluation against requirements | Decision matrix: cost, latency, quality, compliance constraints |
| API deprecation | Production breaks when provider sunsets model version | No model version pinning or migration plan | Pin versions, monitor deprecation notices, abstract provider |
| Benchmark ≠ production | Top benchmark model underperforms on your task | Benchmarks don't represent your distribution | Custom eval on your data before committing |
◆ Hands-On Exercises¶
Exercise 1: Build a Model Selection Matrix¶
Goal: Evaluate 3 models for a specific production use case Time: 30 minutes Steps: 1. Define 5 evaluation criteria (quality, latency, cost, context length, safety) 2. Run 20 representative queries through GPT-4o, Claude Sonnet, and Gemini Flash 3. Score each model on each criterion (1-5) 4. Calculate weighted scores and recommend Expected Output: Decision matrix with recommendation and rationale
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 🔧 Hands-on | LMSYS Chatbot Arena | Live human-evaluated model rankings |
| 🔧 Hands-on | Artificial Analysis | Speed, price, and quality comparisons across LLM providers |
| 📘 Book | "AI Engineering" by Chip Huyen (2025), Ch 2 | Model selection framework for practitioners |
★ Sources¶
- OpenAI model releases — https://openai.com/index
- Google DeepMind Gemini — https://deepmind.google/technologies/gemini/
- Anthropic Claude — https://anthropic.com
- Meta LLaMA 4 announcement (April 2025)
- Chatbot Arena / LMSYS leaderboard — https://chat.lmsys.org