Skip to content

Scaling Laws & Pre-training

Bit: GPT-5.4 cost hundreds of millions of dollars to train. Not because the algorithm is complex — it's literally next-token prediction — but because you need ~25,000 GPUs running for months on trillions of tokens. The secret of LLMs is embarrassingly simple: scale.


★ TL;DR

  • What: The process of training an LLM from scratch on internet-scale data, and the mathematical laws predicting how performance improves with more compute, data, and parameters
  • Why: Understanding pre-training explains WHY bigger models are better, HOW training costs scale, and WHEN to stop training — critical for anyone building or evaluating LLMs
  • Key point: Chinchilla showed training a SMALLER model on MORE data beats a bigger model on less data. This insight reshaped the entire industry.

★ Overview

Definition

Scaling laws describe how model performance changes as compute, data, and parameter count increase. Pre-training is the large-scale learning phase where a model absorbs general patterns before later alignment or specialization.

Scope

This note focuses on the economics, mechanics, and trade-offs of pre-training at scale. For the distributed systems layer behind these runs, see Distributed Training for Large Models.

Significance

  • Scaling behavior explains why frontier labs invest so heavily in data, compute, and optimization.
  • Pre-training remains the foundation beneath later techniques such as fine-tuning, RL alignment, and distillation.

★ Deep Dive

The Pre-training Pipeline

┌──────────────────────────────────────────────────────┐
│         HOW AN LLM IS ACTUALLY TRAINED                │
│                                                      │
│  1. DATA COLLECTION                                  │
│     Crawl the internet: CommonCrawl, Wikipedia,      │
│     books, code (GitHub), research papers, forums    │
│     Scale: 10-15 TRILLION tokens typical (2025-2026) │
│                                                      │
│  2. DATA CLEANING & FILTERING                        │
│     Deduplication (exact + fuzzy matching)            │
│     Quality filtering (classifier-based)             │
│     Toxicity/PII removal                             │
│     Language identification and balancing             │
│     Cost: Months of engineering, underrated          │
│                                                      │
│  3. TOKENIZATION                                     │
│     BPE tokenizer trained on the data                │
│     Vocabulary: 32K-256K tokens                      │
│     See: ../foundations/tokenization.md               │
│                                                      │
│  4. DATA MIX RATIOS (secret sauce)                   │
│     ┌───────────────────────────────────┐            │
│     │ Web text:      ~50-60%            │            │
│     │ Code (GitHub): ~15-25%            │            │
│     │ Books:         ~5-10%             │            │
│     │ Scientific:    ~5-10%             │            │
│     │ Math:          ~3-5%              │            │
│     │ Multilingual:  ~10-20%            │            │
│     │ Conversation:  ~3-5%              │            │
│     └───────────────────────────────────┘            │
│     These ratios MASSIVELY affect capabilities       │
│     More code → better reasoning (!)                 │
│                                                      │
│  5. TRAINING                                         │
│     Objective: Predict the next token                │
│     Hardware: 10K-100K GPUs (H100/H200/B200)         │
│     Duration: 2-6 months                             │
│     Cost: $50M-$500M+ per training run               │
│     Infrastructure: NVIDIA NVLink, InfiniBand,       │
│       distributed training (FSDP, DeepSpeed, Megatron)│
│                                                      │
│  6. MONITORING                                       │
│     Track: loss curves, learning rate, gradient norms │
│     Handle: loss spikes (restart from checkpoint)    │
│     Checkpoint every N steps (recover from crashes)  │
│     Evaluate on held-out benchmarks periodically     │
└──────────────────────────────────────────────────────┘

Scaling Laws

THE CORE INSIGHT (Kaplan et al., 2020):

  Model performance (loss) improves as a POWER LAW with:
    1. Number of parameters (N)
    2. Amount of training data (D)
    3. Amount of compute (C)

  L(C) ∝ C^(-0.05)  (loss decreases with compute)
  L(N) ∝ N^(-0.076) (loss decreases with parameters)
  L(D) ∝ D^(-0.095) (loss decreases with data)

  WHAT THIS MEANS:
  - 10x more compute → predictable improvement
  - Returns diminish but NEVER stop (no plateau found yet)
  - You can PREDICT a model's quality before training it

Chinchilla Scaling (DeepMind, 2022)

THE GAME-CHANGER:

  OpenAI's approach (2020-2022): "Make models BIGGER"
    GPT-3: 175B params, trained on 300B tokens
    Bigger model, less data → expensive inference

  DeepMind's Chinchilla finding:
    "For a given compute budget, you should train a
     SMALLER model on MORE data"

  THE RULE:
    Optimal tokens ≈ 20 × parameters

    Model Size    │ Optimal Data | GPT-3 Used | Chinchilla
    ──────────────┼──────────────┼────────────┼──────────
    10B params    │ 200B tokens  │ (N/A)      │ ✓
    70B params    │ 1.4T tokens  │ 300B (!)   │ ✗ undertrained
    175B params   │ 3.5T tokens  │ 300B (!)   │ ✗ MASSIVELY undertrained

  IMPACT:
    GPT-3 was 10x undertrained by this rule!
    LLaMA (Meta, 2023): 65B model trained on 1.4T tokens
      → Matched GPT-3 175B with 3x fewer parameters!

  POST-CHINCHILLA (2024-2026):
    Industry shifted to "over-training" small models:
    Train way beyond the Chinchilla-optimal point
    because inference cost matters more than training cost.

    LLaMA 3 8B: trained on 15T tokens (1875× params!)
    Reason: Train once (expensive), run forever (cheap)

Training Infrastructure

HARDWARE (2025-2026 training runs):

  GPU: NVIDIA H100 (80GB) → H200 (141GB) → B200/GB300

  Typical cluster:
    GPT-5.x training:    ~25,000+ H100s
    LLaMA 4:             ~16,000 H100s
    Gemini 3.x:          TPU v5p pods (~10,000+ chips)
    DeepSeek-R1:         ~2,000 H100s (cost-efficient!)

  PARALLELISM STRATEGIES:
    Data Parallel:     Same model on each GPU, different data
    Tensor Parallel:   Split layers WITHIN GPUs (same node)
    Pipeline Parallel: Split layers ACROSS GPUs (different nodes)
    Expert Parallel:   MoE experts on different GPUs
    FSDP:             Fully Sharded Data Parallel (PyTorch)

  NETWORKING:
    NVLink:      GPU-to-GPU within node (900 GB/s)
    InfiniBand:  Node-to-node (400 Gb/s per port)
    Critical: Training large models is often NETWORK-bound

  COST EXAMPLES (approximate, 2025-2026):
    LLaMA 3 70B:     ~$10M
    GPT-5 family:    ~$200M-$500M+
    Gemini 3.x:      ~$100M+ (TPU costs differ)
    DeepSeek-R1:     ~$5M (remarkably cost-efficient)

Training Challenges

COMMON FAILURES:
  1. Loss spikes:    Sudden jumps in loss, often from bad data
                     Fix: restart from last checkpoint, skip bad batch

  2. Gradient issues: NaN gradients, exploding/vanishing
                     Fix: gradient clipping, learning rate warmup

  3. Hardware failures: GPUs die during months-long training
                     Fix: aggressive checkpointing, auto-restart

  4. Data contamination: Benchmark data leaks into training set
                     Fix: careful deduplication, held-out evaluation

  5. Instability at scale: Training becomes chaotic at 100B+ params
                     Fix: bf16 precision, μP (maximal update parametrization)

◆ Quick Reference

SCALING RULES OF THUMB:
  Chinchilla:     tokens ≈ 20× parameters (compute-optimal)
  Over-training:  tokens ≈ 100-2000× params (inference-optimal)
  10× compute:    ~5% loss reduction (reliable)

TRAINING COST COMPONENTS:
  GPU hours:       60-80% of total cost
  Networking:      10-15%
  Storage:         5-10%
  Engineering:     10-15% (often underestimated)

PRE-TRAINING OBJECTIVE:
  Next-token prediction: P(token_t | token_1, ..., token_{t-1})
  That's it. This single objective produces all LLM capabilities.

○ Interview Angles

  • Q: Explain the Chinchilla scaling laws.
  • A: For a fixed compute budget, there's an optimal ratio of model size to training data. Chinchilla showed the optimal is ~20 tokens per parameter. GPT-3 (175B params, 300B tokens) was massively undertrained — a 70B model on 1.4T tokens would match it. This led to LLaMA's approach: smaller models, much more data. In 2025-2026, industry "over-trains" beyond Chinchilla-optimal because inference cost (running the model) matters more than training cost (one-time).

  • Q: How is a large language model pre-trained?

  • A: (1) Collect trillions of tokens from internet, books, code. (2) Clean and deduplicate aggressively. (3) Train a BPE tokenizer. (4) Set data mix ratios (web, code, books, math). (5) Train using next-token prediction on 10K-100K GPUs for 2-6 months using distributed parallelism (data, tensor, pipeline). (6) Monitor loss curves, handle spikes, checkpoint regularly. Cost: $10M-$500M+ per run.

★ Code & Implementation

Chinchilla Optimal Token Calculator

# ⚠️ Last tested: 2026-04 | Requires: Python 3.10+ (stdlib only)
# Chinchilla paper (Hoffmann et al. 2022): optimal training = 20 tokens per parameter

def chinchilla_optimal(params: float, budget_override: float | None = None) -> dict:
    """
    params: model parameters (e.g. 7e9 for 7B)
    Returns: Chinchilla-optimal token count and FLOPs estimate
    """
    tokens_optimal = 20 * params          # Chinchilla rule
    flops_estimate = 6 * params * tokens_optimal  # ~6 * N * D for transformer training
    return {
        "params":         params,
        "params_B":       params / 1e9,
        "tokens_optimal": tokens_optimal,
        "tokens_B":       tokens_optimal / 1e9,
        "flops_estimate": flops_estimate,
        "flops_e21":      flops_estimate / 1e21,
    }

# Compare common model sizes
for model_name, params in [
    ("LLaMA 3.2 1B",  1e9),
    ("LLaMA 3.2 8B",  8e9),
    ("LLaMA 3 70B",  70e9),
    ("GPT-3 175B",  175e9),
]:
    r = chinchilla_optimal(params)
    print(
        f"{model_name:<18} | {r['params_B']:>6.1f}B params | "
        f"optimal: {r['tokens_B']:>6.0f}B tokens | "
        f"{r['flops_e21']:.1f}e21 FLOPs"
    )

# Note: Modern models (LLaMA 3, Gemma 3) over-train by 5-10x for better
# inference efficiency — Chinchilla is the floor, not the ceiling.

★ Connections

Relationship Topics
Builds on Transformers, Deep Learning Fundamentals
Leads to Llms Overview, Fine Tuning (SFT stage)
Compare with Fine-tuning (adaptation), Few-shot (no training)
Cross-domain Distributed systems, HPC, Data engineering

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Compute budget misallocation Over-parameterized model with insufficient data (or vice versa) Ignoring Chinchilla scaling laws (20 tokens per parameter) Use Chinchilla-optimal ratios for compute allocation
Data quality plateau Loss stops decreasing despite more compute Training data contains duplicates, noise, low-quality content Deduplicate, filter by perplexity, quality-score data
Emergent capability surprises Capabilities appear or disappear at unexpected scale Phase transitions in model behavior Benchmark at multiple scales, don't extrapolate from small models

◆ Hands-On Exercises

Exercise 1: Plot Your Own Scaling Law

Goal: Train models at 3 different scales and verify power-law behavior Time: 45 minutes Steps: 1. Train a small transformer at 1M, 10M, 100M parameters on the same dataset 2. Log validation loss at each scale 3. Plot loss vs compute on log-log scale 4. Fit a power law and extrapolate Expected Output: Log-log plot showing linear scaling law relationship


Type Resource Why
📄 Paper Kaplan et al. "Scaling Laws for Neural Language Models" (2020) Original OpenAI scaling laws — compute, data, parameters
📄 Paper Hoffmann et al. "Chinchilla" (2022) Revised scaling: compute-optimal training needs more data than expected
🎥 Video Andrej Karpathy — "Let's Build GPT" Build a language model from scratch — pretraining intuition
📘 Book "AI Engineering" by Chip Huyen (2025), Ch 2 Practical understanding of model selection and scaling tradeoffs

★ Sources

  • Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
  • Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022)
  • Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (2023)
  • Meta, "LLaMA 3 Technical Report" (2024)
  • NVIDIA, "Megatron-LM: Training Multi-Billion Parameter Language Models" (2020)