Scaling Laws & Pre-training¶
✨ Bit: GPT-5.4 cost hundreds of millions of dollars to train. Not because the algorithm is complex — it's literally next-token prediction — but because you need ~25,000 GPUs running for months on trillions of tokens. The secret of LLMs is embarrassingly simple: scale.
★ TL;DR¶
- What: The process of training an LLM from scratch on internet-scale data, and the mathematical laws predicting how performance improves with more compute, data, and parameters
- Why: Understanding pre-training explains WHY bigger models are better, HOW training costs scale, and WHEN to stop training — critical for anyone building or evaluating LLMs
- Key point: Chinchilla showed training a SMALLER model on MORE data beats a bigger model on less data. This insight reshaped the entire industry.
★ Overview¶
Definition¶
Scaling laws describe how model performance changes as compute, data, and parameter count increase. Pre-training is the large-scale learning phase where a model absorbs general patterns before later alignment or specialization.
Scope¶
This note focuses on the economics, mechanics, and trade-offs of pre-training at scale. For the distributed systems layer behind these runs, see Distributed Training for Large Models.
Significance¶
- Scaling behavior explains why frontier labs invest so heavily in data, compute, and optimization.
- Pre-training remains the foundation beneath later techniques such as fine-tuning, RL alignment, and distillation.
★ Deep Dive¶
The Pre-training Pipeline¶
┌──────────────────────────────────────────────────────┐
│ HOW AN LLM IS ACTUALLY TRAINED │
│ │
│ 1. DATA COLLECTION │
│ Crawl the internet: CommonCrawl, Wikipedia, │
│ books, code (GitHub), research papers, forums │
│ Scale: 10-15 TRILLION tokens typical (2025-2026) │
│ │
│ 2. DATA CLEANING & FILTERING │
│ Deduplication (exact + fuzzy matching) │
│ Quality filtering (classifier-based) │
│ Toxicity/PII removal │
│ Language identification and balancing │
│ Cost: Months of engineering, underrated │
│ │
│ 3. TOKENIZATION │
│ BPE tokenizer trained on the data │
│ Vocabulary: 32K-256K tokens │
│ See: ../foundations/tokenization.md │
│ │
│ 4. DATA MIX RATIOS (secret sauce) │
│ ┌───────────────────────────────────┐ │
│ │ Web text: ~50-60% │ │
│ │ Code (GitHub): ~15-25% │ │
│ │ Books: ~5-10% │ │
│ │ Scientific: ~5-10% │ │
│ │ Math: ~3-5% │ │
│ │ Multilingual: ~10-20% │ │
│ │ Conversation: ~3-5% │ │
│ └───────────────────────────────────┘ │
│ These ratios MASSIVELY affect capabilities │
│ More code → better reasoning (!) │
│ │
│ 5. TRAINING │
│ Objective: Predict the next token │
│ Hardware: 10K-100K GPUs (H100/H200/B200) │
│ Duration: 2-6 months │
│ Cost: $50M-$500M+ per training run │
│ Infrastructure: NVIDIA NVLink, InfiniBand, │
│ distributed training (FSDP, DeepSpeed, Megatron)│
│ │
│ 6. MONITORING │
│ Track: loss curves, learning rate, gradient norms │
│ Handle: loss spikes (restart from checkpoint) │
│ Checkpoint every N steps (recover from crashes) │
│ Evaluate on held-out benchmarks periodically │
└──────────────────────────────────────────────────────┘
Scaling Laws¶
THE CORE INSIGHT (Kaplan et al., 2020):
Model performance (loss) improves as a POWER LAW with:
1. Number of parameters (N)
2. Amount of training data (D)
3. Amount of compute (C)
L(C) ∝ C^(-0.05) (loss decreases with compute)
L(N) ∝ N^(-0.076) (loss decreases with parameters)
L(D) ∝ D^(-0.095) (loss decreases with data)
WHAT THIS MEANS:
- 10x more compute → predictable improvement
- Returns diminish but NEVER stop (no plateau found yet)
- You can PREDICT a model's quality before training it
Chinchilla Scaling (DeepMind, 2022)¶
THE GAME-CHANGER:
OpenAI's approach (2020-2022): "Make models BIGGER"
GPT-3: 175B params, trained on 300B tokens
Bigger model, less data → expensive inference
DeepMind's Chinchilla finding:
"For a given compute budget, you should train a
SMALLER model on MORE data"
THE RULE:
Optimal tokens ≈ 20 × parameters
Model Size │ Optimal Data | GPT-3 Used | Chinchilla
──────────────┼──────────────┼────────────┼──────────
10B params │ 200B tokens │ (N/A) │ ✓
70B params │ 1.4T tokens │ 300B (!) │ ✗ undertrained
175B params │ 3.5T tokens │ 300B (!) │ ✗ MASSIVELY undertrained
IMPACT:
GPT-3 was 10x undertrained by this rule!
LLaMA (Meta, 2023): 65B model trained on 1.4T tokens
→ Matched GPT-3 175B with 3x fewer parameters!
POST-CHINCHILLA (2024-2026):
Industry shifted to "over-training" small models:
Train way beyond the Chinchilla-optimal point
because inference cost matters more than training cost.
LLaMA 3 8B: trained on 15T tokens (1875× params!)
Reason: Train once (expensive), run forever (cheap)
Training Infrastructure¶
HARDWARE (2025-2026 training runs):
GPU: NVIDIA H100 (80GB) → H200 (141GB) → B200/GB300
Typical cluster:
GPT-5.x training: ~25,000+ H100s
LLaMA 4: ~16,000 H100s
Gemini 3.x: TPU v5p pods (~10,000+ chips)
DeepSeek-R1: ~2,000 H100s (cost-efficient!)
PARALLELISM STRATEGIES:
Data Parallel: Same model on each GPU, different data
Tensor Parallel: Split layers WITHIN GPUs (same node)
Pipeline Parallel: Split layers ACROSS GPUs (different nodes)
Expert Parallel: MoE experts on different GPUs
FSDP: Fully Sharded Data Parallel (PyTorch)
NETWORKING:
NVLink: GPU-to-GPU within node (900 GB/s)
InfiniBand: Node-to-node (400 Gb/s per port)
Critical: Training large models is often NETWORK-bound
COST EXAMPLES (approximate, 2025-2026):
LLaMA 3 70B: ~$10M
GPT-5 family: ~$200M-$500M+
Gemini 3.x: ~$100M+ (TPU costs differ)
DeepSeek-R1: ~$5M (remarkably cost-efficient)
Training Challenges¶
COMMON FAILURES:
1. Loss spikes: Sudden jumps in loss, often from bad data
Fix: restart from last checkpoint, skip bad batch
2. Gradient issues: NaN gradients, exploding/vanishing
Fix: gradient clipping, learning rate warmup
3. Hardware failures: GPUs die during months-long training
Fix: aggressive checkpointing, auto-restart
4. Data contamination: Benchmark data leaks into training set
Fix: careful deduplication, held-out evaluation
5. Instability at scale: Training becomes chaotic at 100B+ params
Fix: bf16 precision, μP (maximal update parametrization)
◆ Quick Reference¶
SCALING RULES OF THUMB:
Chinchilla: tokens ≈ 20× parameters (compute-optimal)
Over-training: tokens ≈ 100-2000× params (inference-optimal)
10× compute: ~5% loss reduction (reliable)
TRAINING COST COMPONENTS:
GPU hours: 60-80% of total cost
Networking: 10-15%
Storage: 5-10%
Engineering: 10-15% (often underestimated)
PRE-TRAINING OBJECTIVE:
Next-token prediction: P(token_t | token_1, ..., token_{t-1})
That's it. This single objective produces all LLM capabilities.
○ Interview Angles¶
- Q: Explain the Chinchilla scaling laws.
-
A: For a fixed compute budget, there's an optimal ratio of model size to training data. Chinchilla showed the optimal is ~20 tokens per parameter. GPT-3 (175B params, 300B tokens) was massively undertrained — a 70B model on 1.4T tokens would match it. This led to LLaMA's approach: smaller models, much more data. In 2025-2026, industry "over-trains" beyond Chinchilla-optimal because inference cost (running the model) matters more than training cost (one-time).
-
Q: How is a large language model pre-trained?
- A: (1) Collect trillions of tokens from internet, books, code. (2) Clean and deduplicate aggressively. (3) Train a BPE tokenizer. (4) Set data mix ratios (web, code, books, math). (5) Train using next-token prediction on 10K-100K GPUs for 2-6 months using distributed parallelism (data, tensor, pipeline). (6) Monitor loss curves, handle spikes, checkpoint regularly. Cost: $10M-$500M+ per run.
★ Code & Implementation¶
Chinchilla Optimal Token Calculator¶
# ⚠️ Last tested: 2026-04 | Requires: Python 3.10+ (stdlib only)
# Chinchilla paper (Hoffmann et al. 2022): optimal training = 20 tokens per parameter
def chinchilla_optimal(params: float, budget_override: float | None = None) -> dict:
"""
params: model parameters (e.g. 7e9 for 7B)
Returns: Chinchilla-optimal token count and FLOPs estimate
"""
tokens_optimal = 20 * params # Chinchilla rule
flops_estimate = 6 * params * tokens_optimal # ~6 * N * D for transformer training
return {
"params": params,
"params_B": params / 1e9,
"tokens_optimal": tokens_optimal,
"tokens_B": tokens_optimal / 1e9,
"flops_estimate": flops_estimate,
"flops_e21": flops_estimate / 1e21,
}
# Compare common model sizes
for model_name, params in [
("LLaMA 3.2 1B", 1e9),
("LLaMA 3.2 8B", 8e9),
("LLaMA 3 70B", 70e9),
("GPT-3 175B", 175e9),
]:
r = chinchilla_optimal(params)
print(
f"{model_name:<18} | {r['params_B']:>6.1f}B params | "
f"optimal: {r['tokens_B']:>6.0f}B tokens | "
f"{r['flops_e21']:.1f}e21 FLOPs"
)
# Note: Modern models (LLaMA 3, Gemma 3) over-train by 5-10x for better
# inference efficiency — Chinchilla is the floor, not the ceiling.
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Transformers, Deep Learning Fundamentals |
| Leads to | Llms Overview, Fine Tuning (SFT stage) |
| Compare with | Fine-tuning (adaptation), Few-shot (no training) |
| Cross-domain | Distributed systems, HPC, Data engineering |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Compute budget misallocation | Over-parameterized model with insufficient data (or vice versa) | Ignoring Chinchilla scaling laws (20 tokens per parameter) | Use Chinchilla-optimal ratios for compute allocation |
| Data quality plateau | Loss stops decreasing despite more compute | Training data contains duplicates, noise, low-quality content | Deduplicate, filter by perplexity, quality-score data |
| Emergent capability surprises | Capabilities appear or disappear at unexpected scale | Phase transitions in model behavior | Benchmark at multiple scales, don't extrapolate from small models |
◆ Hands-On Exercises¶
Exercise 1: Plot Your Own Scaling Law¶
Goal: Train models at 3 different scales and verify power-law behavior Time: 45 minutes Steps: 1. Train a small transformer at 1M, 10M, 100M parameters on the same dataset 2. Log validation loss at each scale 3. Plot loss vs compute on log-log scale 4. Fit a power law and extrapolate Expected Output: Log-log plot showing linear scaling law relationship
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Kaplan et al. "Scaling Laws for Neural Language Models" (2020) | Original OpenAI scaling laws — compute, data, parameters |
| 📄 Paper | Hoffmann et al. "Chinchilla" (2022) | Revised scaling: compute-optimal training needs more data than expected |
| 🎥 Video | Andrej Karpathy — "Let's Build GPT" | Build a language model from scratch — pretraining intuition |
| 📘 Book | "AI Engineering" by Chip Huyen (2025), Ch 2 | Practical understanding of model selection and scaling tradeoffs |
★ Sources¶
- Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
- Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022)
- Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (2023)
- Meta, "LLaMA 3 Technical Report" (2024)
- NVIDIA, "Megatron-LM: Training Multi-Billion Parameter Language Models" (2020)