Reasoning Models & Test-Time Compute¶

✨ Bit: Pre-2024: "Make the model bigger to make it smarter." Post-2024: "Make the model THINK LONGER to make it smarter." This shift — from scaling training compute to scaling inference compute — is the biggest paradigm change since the Transformer.

★ TL;DR¶

What: LLMs that generate internal "thinking" chains before answering, trading more inference compute for dramatically better reasoning
Why: Standard LLMs fail at complex math, logic, and multi-step problems. Reasoning models solve these by "thinking step by step" internally — not as a prompting trick, but as a trained capability
Key point: o1/o3/DeepSeek-R1 represent a new scaling law — test-time compute scaling — where spending more compute at inference yields better answers, sometimes surpassing even larger standard models

★ Overview¶

Definition¶

Reasoning models are LLMs specifically trained (usually via reinforcement learning) to produce extended chains of internal reasoning before generating a final answer. Unlike standard LLMs that generate responses token-by-token in one pass, reasoning models generate a hidden "thinking" block where they plan, verify, backtrack, and self-correct.

Scope¶

Covers: Reasoning model architectures, test-time compute scaling, process reward models, and when to use reasoning vs standard models. For basic prompting techniques (zero-shot CoT), see Prompt Engineering. For standard LLM overview, see Llms Overview.

Last verified for model-lineup and timeline references: 2026-04.

Significance¶

o1 (Sep 2024) proved reasoning models can solve PhD-level problems standard LLMs can't
DeepSeek-R1 (Jan 2025) showed open-source reasoning is viable
This is the #1 interview topic for frontier GenAI roles in 2025-2026
Fundamentally changes when to use "bigger model" vs "more thinking time"

Prerequisites¶

Llms Overview — how standard LLMs work
Prompt Engineering — chain-of-thought prompting
Probability And Statistics — reinforcement learning basics

★ Deep Dive¶

The Two Scaling Laws¶

ERA 1: PRE-TRAINING SCALING (2020-2024)
  "Make the model bigger, train on more data"

  Performance ∝ model_size × data_size × training_compute

  GPT-3 (175B) → GPT-4 (~1.8T) → GPT-5 (~1T+)
  Problem: Diminishing returns. 10x compute → ~1.5x better.
  Cost: $100M+ per training run.

ERA 2: TEST-TIME COMPUTE SCALING (2024+)
  "Let the model think longer on hard problems"

  Performance ∝ inference_compute (thinking tokens)

  Small model + 100 thinking tokens → beats large model on reasoning
  Cost: Pay per-problem (harder problems = more tokens = more cost)

  KEY INSIGHT: You can DYNAMICALLY allocate compute per problem.
  Easy question → fast answer. Hard math → 10 minutes of thinking.

  THE SCALING CEILING: Test-time compute scaling is bounded by the model's ability to verify its own logic (the verification-generation gap). If verifying a step is as hard as generating it, inference scaling flatlines. Furthermore, massive thinking chains lead to KV cache exhaustion on GPUs.

How Reasoning Models Work¶

STANDARD LLM:
  User: "What is 27 × 34?"
  Model: "918" ← Direct answer (often wrong for harder math)
  Tokens: ~5

REASONING MODEL:
  User: "What is 27 × 34?"

  [THINKING — hidden from user]
  "I need to multiply 27 × 34.
   Let me break this down:
   27 × 34 = 27 × 30 + 27 × 4
   27 × 30 = 810
   27 × 4 = 108
   810 + 108 = 918
   Let me verify: 918 / 27 = 34 ✓"
  [/THINKING]

  Model: "918" ← Same answer, but verified
  Tokens: ~80 (thinking) + 5 (answer)

The Training Pipeline¶

┌─────────────────────────────────────────────────────┐
│         HOW REASONING MODELS ARE TRAINED             │
│                                                     │
│  STEP 1: Start with a pre-trained LLM              │
│          (e.g., GPT-4 base, DeepSeek-V3 base)      │
│                                                     │
│  STEP 2: Supervised Fine-Tuning on reasoning traces │
│          Human-written step-by-step solutions        │
│          "Here's HOW to solve this problem"          │
│                                                     │
│  STEP 3: Reinforcement Learning (the key step)      │
│          ┌──────────────────────────────────────┐   │
│          │ Model generates chain-of-thought     │   │
│          │ → Check if final answer is correct   │   │
│          │ → Reward correct reasoning paths     │   │
│          │ → Penalize wrong paths               │   │
│          │ → Model learns WHICH thinking        │   │
│          │   strategies lead to right answers   │   │
│          └──────────────────────────────────────┘   │
│          Methods: PPO, GRPO (DeepSeek)              │
│                                                     │
│  STEP 4: Process Reward Models (PRM)                │
│          Don't just check the final answer —         │
│          evaluate EACH STEP of reasoning.           │
│          "Step 3 was wrong" → more granular signal  │
│                                                     │
│  Result: Model that knows WHEN and HOW to think     │
└─────────────────────────────────────────────────────┘

Major Reasoning Models (April 2026)¶

Model	Company	Key Feature	Access
o1	OpenAI	First reasoning model, PhD-level science	API
o3	OpenAI	Stronger reasoning, variable compute	API
GPT-5.4 mini	OpenAI	Cost-effective reasoning, fast (adaptive thinking)	API
DeepSeek-R1	DeepSeek	Open-weight, competitive with o1	Open
QwQ	Alibaba	Open reasoning model	Open
Gemini 3.1 Deep Think	Google	Complex technical reasoning, multimodal	API
Claude with extended thinking	Anthropic	Toggleable thinking mode	API

Standard vs Reasoning: When to Use What¶

Scenario	Use Standard LLM	Use Reasoning Model
Simple Q&A, chat	✅ Fast, cheap	❌ Overkill
Translation, summarization	✅	❌
Complex math problems	❌ Often wrong	✅ Step-by-step verification
Multi-step logic/planning	❌	✅
Code debugging (complex)	⚠️ Sometimes	✅ Better at tracing issues
Creative writing	✅	❌ Unnecessary reasoning
PhD-level science	❌	✅ Designed for this
Real-time chat (low latency)	✅	❌ Thinking adds latency
Cost-sensitive applications	✅	⚠️ Thinking tokens cost money

Test-Time Compute Techniques¶

Technique	How It Works	Used In
Extended CoT	Model generates long reasoning chains	o1, o3, DeepSeek-R1
Self-consistency	Generate N answers, take majority vote	Any LLM
Best-of-N	Generate N answers, pick best (via reward model)	Any LLM
Monte Carlo Tree Search	Explore reasoning paths like a chess engine	Research
Process Reward Models	Score each reasoning step, not just final answer	o1, o3
Iterative refinement	Model critiques and improves its own answer	Claude, GPT

DeepSeek-R1: The Open-Source Breakthrough¶

WHY IT MATTERS:
  - Open-weight reasoning model competitive with o1
  - Showed reasoning can emerge from pure RL (no supervised data!)
  - "Aha moment": During training, the model spontaneously started
    re-evaluating and self-correcting — emergent reasoning behavior

TRAINING APPROACH (GRPO):
  Group Relative Policy Optimization
  - Generate multiple solutions in parallel
  - Rank them against each other (group-relative, not absolute)
  - No separate reward model needed
  - Simpler and cheaper than PPO

DISTILLED VERSIONS:
  DeepSeek-R1-Distill-Qwen-32B, DeepSeek-R1-Distill-Llama-70B
  → Distill R1's reasoning into smaller models
  → Available via Ollama for local use

◆ Code & Implementation¶

# ⚠️ Last tested: 2026-04
# ═══ Using OpenAI o3 ═══
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="o3",
    messages=[{
        "role": "user",
        "content": "Prove that √2 is irrational."
    }],
    # Reasoning models handle CoT internally
    # No need for "think step by step" prompts
    # reasoning_effort="high"  # low/medium/high — controls thinking depth
)

# The response includes the final answer
# Thinking tokens are consumed but hidden by default
print(response.choices[0].message.content)
print(f"Total tokens: {response.usage.total_tokens}")
# → Much higher token count due to internal reasoning

# ═══ Using DeepSeek-R1 locally via Ollama ═══
# ollama run deepseek-r1:8b
# The model outputs <think>...</think> blocks visibly

◆ Quick Reference¶

REASONING MODEL DECISION TREE:
  Is the task complex reasoning/math/logic?
    YES → Use reasoning model (o3, DeepSeek-R1)
    NO  → Use standard LLM (GPT-5.4, Claude Sonnet 4.6)

  Is latency critical?
    YES → Use standard LLM or GPT-5.4 mini
    NO  → Reasoning model is fine

  Is cost critical?
    YES → GPT-5.4 mini or DeepSeek-R1 (open, self-host)
    NO  → o3 with high reasoning effort

KEY NUMBERS:
  o1 on AIME 2024: 83% (vs GPT-4o: 13%)
  o3 on ARC-AGI: 87.5% (vs GPT-4o: ~5%)
  DeepSeek-R1 on MATH-500: 97.3%

  Thinking tokens: 500-50,000+ per problem
  Cost: 2-20x more than standard models per query

○ Gotchas & Common Mistakes¶

⚠️ Don't prompt "think step by step": Reasoning models already do this internally. Adding CoT prompts can actually hurt performance.
⚠️ KV Cache Explosion & Cost Surprise: A single complex query can consume 50K+ tokens of thinking. Because reasoning models autoregressively output and attend to these hidden tokens, the KV cache grows massive during test-time compute. This forces dynamic PagedAttention management and huge API costs. Monitor token economics ruthlessly.
⚠️ Latency: Thinking takes time. A hard math problem might take 30-60 seconds. Not suitable for real-time chat.
⚠️ Not always better: For simple tasks, reasoning models waste compute and can overthink. Use standard models for simple tasks.
⚠️ Hidden thinking ≠ explainable: You see the answer but not always the reasoning (o1/o3 hide thinking by default).

○ Interview Angles¶

Q: What is test-time compute scaling and why does it matter?
A: Instead of scaling model size (pre-training compute), you scale compute at inference — let the model "think longer" on harder problems. This is more efficient because you allocate compute per-problem (easy = cheap, hard = expensive) rather than baking it all into a massive model. o1/o3 showed this can match or exceed much larger standard models.
Q: How is DeepSeek-R1 trained?
A: Uses GRPO (Group Relative Policy Optimization). Generate multiple reasoning chains for a problem, rank them group-relatively, and reinforce better paths. Remarkably, reasoning behavior (self-correction, re-evaluation) emerged purely from RL without supervised reasoning data.
Q: When would you NOT use a reasoning model?
A: Simple tasks (chat, translation, summarization), latency-critical applications (real-time), cost-sensitive high-volume scenarios, and creative tasks where "thinking" adds no value. Reasoning models are for problems where correct step-by-step logic matters.

★ Connections¶

Relationship	Topics
Builds on	Llms Overview, Prompt Engineering (CoT), Deep Learning Fundamentals (RL)
Leads to	Ethics Safety Alignment (alignment via RL), Inference Optimization (serving reasoning models)
Compare with	Standard LLMs (direct generation), Ai Agents (multi-step but external planning)
Cross-domain	Formal verification, Theorem proving, Game AI (MCTS)

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Thinking token explosion	Reasoning model uses 10K+ thinking tokens on simple questions	No complexity routing, all queries sent to reasoning model	Classify complexity first, route simple queries to base model
False reasoning chains	Plausible-sounding but logically invalid reasoning	Chain-of-thought doesn't guarantee logical validity	Verification step, structured output for checkable steps
Latency unacceptable	15-30 second response times	Test-time compute inherently slow	Streaming partial results, speculative execution, caching
Cost 10x vs base model	Reasoning model costs overwhelm budget	Using o1-level model for all queries	Tiered model selection, reasoning only for complex queries

◆ Hands-On Exercises¶

Exercise 1: Build a Complexity Router¶

Goal: Route queries to reasoning vs non-reasoning models based on complexity Time: 30 minutes Steps: 1. Create 20 queries spanning simple factual to complex multi-step reasoning 2. Build a lightweight classifier (LLM-as-judge or heuristic) for complexity 3. Route simple queries to GPT-4o-mini and complex to o3-mini 4. Compare cost vs quality with and without routing Expected Output: 50-70% cost reduction with <5% quality drop on simple queries

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Wei et al. "Chain-of-Thought Prompting" (2022)	Foundational paper on reasoning in LLMs
📄 Paper	DeepSeek-R1 Technical Report (2025)	How GRPO enables reasoning model training
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 5	Covers reasoning techniques and their production implications
🎥 Video	Andrej Karpathy — "Deep Dive into o1"	Analysis of reasoning model architectures

★ Sources¶

OpenAI, "Learning to Reason with LLMs" (o1 blog post, Sep 2024)
DeepSeek, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (Jan 2025)
Snell et al., "Scaling LLM Test-Time Compute" (2024)
OpenAI o3 release announcement (2025), GPT-5.4 mini (March 2026)