Probability & Statistics for AI¶

✨ Bit: LLMs don't "think." They compute probability distributions over tokens and sample from them. "The capital of France is ___" → P("Paris") = 0.95, P("Lyon") = 0.03. That's literally all it does.

★ TL;DR¶

What: The probability and statistics concepts that underpin how GenAI models learn, generate, and are evaluated
Why: LLMs are probability machines. Understanding distributions, sampling, and loss functions = understanding how models generate text.
Key point: Temperature, top-k, top-p? Those are sampling strategies from probability theory. Cross-entropy loss? That's information theory. You use this daily in GenAI.

★ Overview¶

Definition¶

This document covers the specific probability and statistics concepts needed for GenAI — focused on what matters for understanding model training, text generation, and evaluation.

Scope¶

GenAI-relevant probability only. Not a full statistics course. Covers: distributions, Bayes, loss functions, and sampling strategies.

Prerequisites¶

Basic math (addition, multiplication, exponents)

★ Deep Dive¶

Probability Basics for GenAI¶

FUNDAMENTAL IDEA:
  P(next token = "Paris" | context = "The capital of France is")

  This is what EVERY language model computes:
  "Given what came before, what's the probability of each possible next word?"

  The model outputs a probability DISTRIBUTION over the entire vocabulary:

  Token         Probability
  "Paris"       0.92
  "Lyon"        0.03
  "the"         0.01
  "Berlin"      0.001
  ...           ...
  (50,000+ tokens, all probabilities sum to 1.0)

Key Distributions¶

Distribution	Shape	Where in GenAI
Uniform	All outcomes equally likely	Random initialization of weights
Normal/Gaussian	Bell curve (μ = mean, σ = std)	Weight initialization, diffusion (noise is Gaussian!), embeddings
Categorical	Probability over discrete options	LLM output: probability over vocab tokens
Bernoulli	Binary (yes/no)	Dropout (randomly disable neurons)

Normal (Gaussian) Distribution:

       ┌────────────────┐
       │    ████████    │
       │  ████████████  │         μ = mean (center)
       │████████████████│         σ = std deviation (spread)
       │████████████████│
  ─────┼────────────────┼─────
      -3σ  -2σ  -σ   μ   σ   2σ  3σ

  68% of data within 1σ of mean
  95% within 2σ
  99.7% within 3σ

GenAI Use: Diffusion models ADD Gaussian noise to images during training
           and learn to REMOVE it during generation.

Bayes' Theorem¶

P(A|B) = P(B|A) × P(A) / P(B)

In plain English:
  "Probability of A given B" =
    "How likely B is if A is true" × "How likely A is" ÷ "How likely B is overall"

GenAI connection:
  The whole language model can be viewed as:
  P(next token | all previous tokens) — conditional probability

  Bayesian updating is conceptually how we think about
  adding new information (RAG context) to change model predictions.

Loss Functions (How Models Learn)¶

Loss = "How wrong is the model?" (lower = better)

TRAINING GOAL: Minimize the loss function by adjusting weights.

Loss Function	Formula Intuition	Used For
Cross-Entropy	-Σ y·log(ŷ)	LLM pre-training (next-token prediction), classification
MSE	Σ(y - ŷ)² / n	Diffusion (predict noise), regression
KL Divergence	How different are two distributions	VAEs, RLHF (keep model close to original)
Binary Cross-Entropy	Cross-entropy for yes/no	Binary classification, DPO

CROSS-ENTROPY EXAMPLE (LLM training):

  True next token: "Paris" (one-hot: [0, 0, 1, 0, ...])
  Model prediction: [0.05, 0.02, 0.85, 0.08, ...]

  Loss = -log(0.85) = 0.16  ← Small loss! Model is mostly right.

  If model predicted P("Paris") = 0.01:
  Loss = -log(0.01) = 4.6   ← Large loss! Model is very wrong.

  Training pushes: "Increase P(correct token), decrease P(wrong tokens)"

Sampling Strategies (How LLMs Generate Text)¶

After computing P(next token), how do we PICK the actual token?

GREEDY: Always pick the highest probability token.
  P: [Paris=0.92, Lyon=0.03, ...] → Always outputs "Paris"
  ✅ Deterministic, consistent
  ❌ Boring, repetitive, no creativity

TEMPERATURE SAMPLING:
  Adjust probabilities before sampling:
  P_adjusted = softmax(logits / temperature)

  temperature = 0.0: → Greedy (always pick top)
  temperature = 0.3: → Mostly top tokens, slight variety
  temperature = 0.7: → Balanced creativity
  temperature = 1.0: → Original distribution
  temperature = 2.0: → Very random, less coherent

  HOW IT WORKS:
  Low temp → Sharpens distribution (top token dominates)
  High temp → Flattens distribution (all tokens more equal)

Strategy	How It Works	When to Use
Greedy	Pick highest P every time	Factual/deterministic tasks
Temperature	Scale logits before softmax	General creativity control
Top-K	Only consider top K tokens	Prevent very rare token selection
Top-P (Nucleus)	Consider smallest set of tokens whose P sums to P	Adaptive — good default
Top-K + Top-P	Apply both filters	Production default for most APIs

# ⚠️ Last tested: 2026-04
# OpenAI API example — these ARE sampling strategies
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=0.7,    # Creativity level
    top_p=0.9,          # Nucleus sampling (consider top 90% probability mass)
    max_tokens=200
)

◆ Quick Reference¶

SAMPLING PARAMETERS:
  temperature = 0    → Deterministic (data extraction, coding)
  temperature = 0.3  → Low creativity (summarization, Q&A)
  temperature = 0.7  → Balanced (general chat, writing)
  temperature = 1.0  → Full creativity (brainstorming)
  top_p = 0.9        → Good default (ignore bottom 10% probability)
  top_k = 50         → Only consider top 50 tokens

LOSS FUNCTIONS:
  LLM training    → Cross-entropy
  Diffusion       → MSE (predict noise)
  Classification  → Cross-entropy
  Regression      → MSE
  RLHF           → KL divergence + reward

KEY DISTRIBUTIONS:
  Gaussian noise → Diffusion models
  Categorical    → Token generation
  Uniform        → Weight initialization

○ Gotchas & Common Mistakes¶

⚠️ Temperature 0 ≠ deterministic in all APIs: Some implementations still have slight randomness at temperature 0. Use seed parameter for true determinism.
⚠️ Cross-entropy loss can be misleading: Low loss ≠ good model. A model with low loss might still hallucinate or be unsafe.
⚠️ Top-P and Top-K stack: They're applied together, not alternatives. Top-K filters first, then Top-P within the remaining.
⚠️ "Probability" in LLMs isn't belief: The model's P("Paris") = 0.92 doesn't mean it "believes" Paris is the answer. It means the pattern "France is [Paris]" is statistically dominant in training data.

○ Interview Angles¶

Q: What is temperature in LLM generation?
A: Temperature scales the logits before softmax. Low temperature (→0) makes the distribution sharper (confident picks), high temperature makes it flatter (random picks). Mathematically: P = softmax(logits / T).
Q: What loss function do LLMs use and why?
A: Cross-entropy loss. It measures how different the model's predicted probability distribution is from the true distribution (where the correct next token has probability 1). Minimizing cross-entropy pushes the model to assign high probability to the correct token.

★ Connections¶

Relationship	Topics
Builds on	Basic math
Leads to	Neural Networks, Deep Learning Fundamentals, Llms Overview
Compare with	Deterministic programming (no randomness), Rule-based systems
Cross-domain	Information theory (entropy), Bayesian statistics, Signal processing

★ Recommended Resources¶

Type	Resource	Why
📘 Book	"All of Statistics" by Wasserman (2004)	Concise treatment of statistics for ML
🎥 Video	StatQuest with Josh Starmer	Best visual explanations of statistical concepts
🎓 Course	MIT 6.041: Intro to Probability	Rigorous probability foundations

★ Sources¶

Ian Goodfellow, "Deep Learning" Chapter 3 (Probability and Information Theory)
OpenAI API Parameter Guide — https://platform.openai.com/docs/api-reference
StatQuest "Probability" series — https://youtube.com/statquest