Probability & Statistics for AI¶
✨ Bit: LLMs don't "think." They compute probability distributions over tokens and sample from them. "The capital of France is ___" → P("Paris") = 0.95, P("Lyon") = 0.03. That's literally all it does.
★ TL;DR¶
- What: The probability and statistics concepts that underpin how GenAI models learn, generate, and are evaluated
- Why: LLMs are probability machines. Understanding distributions, sampling, and loss functions = understanding how models generate text.
- Key point: Temperature, top-k, top-p? Those are sampling strategies from probability theory. Cross-entropy loss? That's information theory. You use this daily in GenAI.
★ Overview¶
Definition¶
This document covers the specific probability and statistics concepts needed for GenAI — focused on what matters for understanding model training, text generation, and evaluation.
Scope¶
GenAI-relevant probability only. Not a full statistics course. Covers: distributions, Bayes, loss functions, and sampling strategies.
Prerequisites¶
- Basic math (addition, multiplication, exponents)
★ Deep Dive¶
Probability Basics for GenAI¶
FUNDAMENTAL IDEA:
P(next token = "Paris" | context = "The capital of France is")
This is what EVERY language model computes:
"Given what came before, what's the probability of each possible next word?"
The model outputs a probability DISTRIBUTION over the entire vocabulary:
Token Probability
"Paris" 0.92
"Lyon" 0.03
"the" 0.01
"Berlin" 0.001
... ...
(50,000+ tokens, all probabilities sum to 1.0)
Key Distributions¶
| Distribution | Shape | Where in GenAI |
|---|---|---|
| Uniform | All outcomes equally likely | Random initialization of weights |
| Normal/Gaussian | Bell curve (μ = mean, σ = std) | Weight initialization, diffusion (noise is Gaussian!), embeddings |
| Categorical | Probability over discrete options | LLM output: probability over vocab tokens |
| Bernoulli | Binary (yes/no) | Dropout (randomly disable neurons) |
Normal (Gaussian) Distribution:
┌────────────────┐
│ ████████ │
│ ████████████ │ μ = mean (center)
│████████████████│ σ = std deviation (spread)
│████████████████│
─────┼────────────────┼─────
-3σ -2σ -σ μ σ 2σ 3σ
68% of data within 1σ of mean
95% within 2σ
99.7% within 3σ
GenAI Use: Diffusion models ADD Gaussian noise to images during training
and learn to REMOVE it during generation.
Bayes' Theorem¶
P(A|B) = P(B|A) × P(A) / P(B)
In plain English:
"Probability of A given B" =
"How likely B is if A is true" × "How likely A is" ÷ "How likely B is overall"
GenAI connection:
The whole language model can be viewed as:
P(next token | all previous tokens) — conditional probability
Bayesian updating is conceptually how we think about
adding new information (RAG context) to change model predictions.
Loss Functions (How Models Learn)¶
Loss = "How wrong is the model?" (lower = better)
TRAINING GOAL: Minimize the loss function by adjusting weights.
| Loss Function | Formula Intuition | Used For |
|---|---|---|
| Cross-Entropy | -Σ y·log(ŷ) | LLM pre-training (next-token prediction), classification |
| MSE | Σ(y - ŷ)² / n | Diffusion (predict noise), regression |
| KL Divergence | How different are two distributions | VAEs, RLHF (keep model close to original) |
| Binary Cross-Entropy | Cross-entropy for yes/no | Binary classification, DPO |
CROSS-ENTROPY EXAMPLE (LLM training):
True next token: "Paris" (one-hot: [0, 0, 1, 0, ...])
Model prediction: [0.05, 0.02, 0.85, 0.08, ...]
Loss = -log(0.85) = 0.16 ← Small loss! Model is mostly right.
If model predicted P("Paris") = 0.01:
Loss = -log(0.01) = 4.6 ← Large loss! Model is very wrong.
Training pushes: "Increase P(correct token), decrease P(wrong tokens)"
Sampling Strategies (How LLMs Generate Text)¶
After computing P(next token), how do we PICK the actual token?
GREEDY: Always pick the highest probability token.
P: [Paris=0.92, Lyon=0.03, ...] → Always outputs "Paris"
✅ Deterministic, consistent
❌ Boring, repetitive, no creativity
TEMPERATURE SAMPLING:
Adjust probabilities before sampling:
P_adjusted = softmax(logits / temperature)
temperature = 0.0: → Greedy (always pick top)
temperature = 0.3: → Mostly top tokens, slight variety
temperature = 0.7: → Balanced creativity
temperature = 1.0: → Original distribution
temperature = 2.0: → Very random, less coherent
HOW IT WORKS:
Low temp → Sharpens distribution (top token dominates)
High temp → Flattens distribution (all tokens more equal)
| Strategy | How It Works | When to Use |
|---|---|---|
| Greedy | Pick highest P every time | Factual/deterministic tasks |
| Temperature | Scale logits before softmax | General creativity control |
| Top-K | Only consider top K tokens | Prevent very rare token selection |
| Top-P (Nucleus) | Consider smallest set of tokens whose P sums to P | Adaptive — good default |
| Top-K + Top-P | Apply both filters | Production default for most APIs |
# ⚠️ Last tested: 2026-04
# OpenAI API example — these ARE sampling strategies
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a poem"}],
temperature=0.7, # Creativity level
top_p=0.9, # Nucleus sampling (consider top 90% probability mass)
max_tokens=200
)
◆ Quick Reference¶
SAMPLING PARAMETERS:
temperature = 0 → Deterministic (data extraction, coding)
temperature = 0.3 → Low creativity (summarization, Q&A)
temperature = 0.7 → Balanced (general chat, writing)
temperature = 1.0 → Full creativity (brainstorming)
top_p = 0.9 → Good default (ignore bottom 10% probability)
top_k = 50 → Only consider top 50 tokens
LOSS FUNCTIONS:
LLM training → Cross-entropy
Diffusion → MSE (predict noise)
Classification → Cross-entropy
Regression → MSE
RLHF → KL divergence + reward
KEY DISTRIBUTIONS:
Gaussian noise → Diffusion models
Categorical → Token generation
Uniform → Weight initialization
○ Gotchas & Common Mistakes¶
- ⚠️ Temperature 0 ≠ deterministic in all APIs: Some implementations still have slight randomness at temperature 0. Use
seedparameter for true determinism. - ⚠️ Cross-entropy loss can be misleading: Low loss ≠ good model. A model with low loss might still hallucinate or be unsafe.
- ⚠️ Top-P and Top-K stack: They're applied together, not alternatives. Top-K filters first, then Top-P within the remaining.
- ⚠️ "Probability" in LLMs isn't belief: The model's P("Paris") = 0.92 doesn't mean it "believes" Paris is the answer. It means the pattern "France is [Paris]" is statistically dominant in training data.
○ Interview Angles¶
- Q: What is temperature in LLM generation?
-
A: Temperature scales the logits before softmax. Low temperature (→0) makes the distribution sharper (confident picks), high temperature makes it flatter (random picks). Mathematically: P = softmax(logits / T).
-
Q: What loss function do LLMs use and why?
- A: Cross-entropy loss. It measures how different the model's predicted probability distribution is from the true distribution (where the correct next token has probability 1). Minimizing cross-entropy pushes the model to assign high probability to the correct token.
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Basic math |
| Leads to | Neural Networks, Deep Learning Fundamentals, Llms Overview |
| Compare with | Deterministic programming (no randomness), Rule-based systems |
| Cross-domain | Information theory (entropy), Bayesian statistics, Signal processing |
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📘 Book | "All of Statistics" by Wasserman (2004) | Concise treatment of statistics for ML |
| 🎥 Video | StatQuest with Josh Starmer | Best visual explanations of statistical concepts |
| 🎓 Course | MIT 6.041: Intro to Probability | Rigorous probability foundations |
★ Sources¶
- Ian Goodfellow, "Deep Learning" Chapter 3 (Probability and Information Theory)
- OpenAI API Parameter Guide — https://platform.openai.com/docs/api-reference
- StatQuest "Probability" series — https://youtube.com/statquest