Large Language Models (LLMs)¶

✨ Bit: LLMs are stochastic parrots that accidentally learned to reason. Or did they? The debate continues.

★ TL;DR¶

What: Neural networks (Transformer-based) trained on massive text corpora to understand and generate human language
Why: Foundation of modern AI assistants, code generation, search, and almost every GenAI product
Key point: "Scaling laws" showed that performance predictably improves with more data + compute + parameters. This triggered the arms race.

★ Overview¶

Definition¶

Large Language Models (LLMs) are autoregressive Transformer models (decoder-only) with billions to trillions of parameters, trained on internet-scale text data to predict the next token. Through scale, they develop emergent capabilities: reasoning, coding, translation, analysis — tasks they were never explicitly taught.

Scope¶

This document covers LLMs as a category. For specific model families, see sub-documents. For the underlying architecture, see Transformers. For reliability risks and grounding strategy, see Hallucination Detection & Mitigation.

Significance¶

The core technology behind ChatGPT, Claude, Gemini, Copilot
LLM market: $7.77B (2025) → projected $10.57B (2026)
Anthropic surpassed OpenAI in enterprise usage in 2025

Last verified for market and provider-snapshot statements: 2026-04.

Prerequisites¶

Transformers — architecture
Attention Mechanism — how attention works

★ Deep Dive¶

How LLMs Are Built¶

Phase 1: PRE-TRAINING (the expensive part)
  ┌─────────────────────────────────────────────────┐
  │ Internet text (trillions of tokens)              │
  │            ↓                                     │
  │ Train: Predict next token                        │
  │   "The cat sat on the ___" → "mat"              │
  │            ↓                                     │
  │ Result: Base model (knows language, world facts) │
  │          Cost: $10M - $100M+                     │
  └─────────────────────────────────────────────────┘

Phase 2: ALIGNMENT (making it helpful & safe)
  ┌─────────────────────────────────────────────────┐
  │ SFT: Supervised Fine-Tuning on instruction data │
  │   Input: "Explain quantum computing"            │
  │   Output: [high-quality human-written answer]    │
  │            ↓                                     │
  │ RLHF/DPO: Learn from human preferences          │
  │   "Which response is better: A or B?"           │
  │            ↓                                     │
  │ Result: Chat model (helpful, harmless, honest)   │
  └─────────────────────────────────────────────────┘

Phase 3: DEPLOYMENT
  ┌─────────────────────────────────────────────────┐
  │ API / Chat interface                             │
  │ + RAG for up-to-date knowledge                  │
  │ + Tool use for actions (search, code execution) │
  │ + Guardrails for safety                          │
  └─────────────────────────────────────────────────┘

The Major Model Families (March 2026)¶

Closed-Source¶

Model	Company	Latest	Key Strengths	Context
GPT	OpenAI	GPT-5.4 Pro	Unified reasoning + multimodal, reduced hallucinations	Large
Claude	Anthropic	Opus 4.6, Sonnet 4.6	Best coding + agents, extended thinking	200K
Gemini	Google	3.1 Pro, 3 Deep Think	Massive context (2M tokens), science/research	2M

Open-Weight¶

Model	Company	Latest	Key Strengths	Architecture
LLaMA	Meta	LLaMA 4 (Scout/Maverick/Behemoth)	First multimodal LLaMA, MoE	MoE, 10M context (Scout)
Qwen	Alibaba	Qwen 2.5+	Surpassed LLaMA in open-source popularity	Dense & MoE
Mistral	Mistral AI	Mistral Large 2	Strong European alternative	Dense & MoE
DeepSeek	DeepSeek	DeepSeek-V3, R1	Competitive at fraction of cost	MoE
Gemma	Google	Gemma 2	Small but powerful (2B-27B)	Dense

Scaling Laws (Chinchilla)¶

The relationship between model size, data, and performance:

Performance ∝ (Compute)^α

Where Compute = f(Parameters × Training Tokens)

Chinchilla optimal: Train for ~20 tokens per parameter
  - 7B model → 140B tokens
  - 70B model → 1.4T tokens

Modern trend: Over-train smaller models (more tokens per param)
for better inference efficiency

Key Concepts Every Deep-Tech Person Must Know¶

Tokenization¶

Text → numbers. Models don't see words; they see token IDs.

# ⚠️ Last tested: 2026-04
"Hello world" → [15496, 995]        # GPT-style BPE
"Hello world" → [8774, 296, 1650]    # Different tokenizer

# ~4 characters ≈ 1 token (English average)
# Non-English: often 2-3x more tokens per word

Tokenizers: BPE (GPT), WordPiece (BERT), SentencePiece (LLaMA/Gemini)

Inference: How Generation Works¶

Input: "The capital of France is"

Step 1: Process all input tokens (prefill)
Step 2: Generate token "Paris" → append to sequence
Step 3: Generate token "." → append
Step 4: Generate token "<EOS>" → stop

Each step: Full forward pass through the model
KV Cache: Store key/value pairs to avoid recomputation

Temperature & Sampling¶

Temperature = 0.0: Always pick highest probability (deterministic)
Temperature = 0.7: Balanced creativity vs coherence (common default)
Temperature = 1.0: Full probability distribution
Temperature > 1.0: More random/creative

Top-p (nucleus sampling): Only sample from tokens whose cumulative
probability reaches p (e.g., top_p=0.9)

Top-k: Only sample from the k most likely tokens

◆ Comparison¶

Aspect	GPT-5.x	Claude 4.x	Gemini 3.x	LLaMA 4
Best at	General reasoning	Coding + agents	Long context + science	Open-weight flexibility
Context	Large	200K	Up to 2M	10M (Scout)
Access	API only	API only	API + Cloud	Downloadable weights
Cost	$$$	$$	$$	Free (compute costs)
Fine-tuning	Limited	Limited	Via Vertex AI	Full control
Architecture	Dense (rumored MoE)	Dense	Dense variants	MoE

◆ Use Cases & Applications¶

Use Case	How LLMs Are Used	Key Challenge
Chat assistants	Direct conversation (ChatGPT, Claude)	Hallucination
Code generation	Copilot, Cursor, Devin	Correctness verification
Search	Perplexity, Google AI Overviews	Up-to-date knowledge
Document processing	Summarization, extraction, Q&A	Long document handling
Translation	Near-human quality across languages	Nuance, cultural context
Agents	Autonomous task execution with tools	Reliability, safety

○ Gotchas & Common Mistakes¶

⚠️ Hallucination ≠ lying: The model generates plausible continuations, not facts. It has no concept of truth.
⚠️ Context window ≠ memory: LLMs don't remember across conversations unless you build memory systems
⚠️ Bigger ≠ always better: A well-fine-tuned 7B model can beat a generic 70B model on specific tasks
⚠️ Tokens ≠ words: Pricing and limits are in tokens (~4 chars each). Non-English = more tokens
⚠️ Benchmarks lie: Models are increasingly trained on benchmark data. Real-world eval matters more

○ Interview Angles¶

Q: Explain the training pipeline of a modern LLM.
A: Pre-training (next-token prediction on internet text) → SFT (supervised fine-tuning on instruction-response pairs) → RLHF/DPO (learning from human preference comparisons) → Safety alignment
Q: What's the difference between dense and MoE architectures?
A: Dense: every parameter processes every token (e.g., GPT-4, Claude). MoE: tokens are routed to a subset of "expert" sub-networks (e.g., LLaMA 4 Maverick). MoE gives more total capacity with less compute per token.
Q: How do you choose between using an API (GPT/Claude) vs hosting an open model (LLaMA)?
A: API: faster to start, best performance, no infra. Self-host: data stays private, no vendor lock-in, customizable. Cost crossover: at ~1M+ tokens/day, self-hosting often becomes cheaper.

★ Code & Implementation¶

Call OpenAI GPT API (Streaming)¶

# pip install openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY env var
from openai import OpenAI

client = OpenAI()

# Token counting pre-flight (avoid hitting context limits)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
user_message = "Explain transformers in 3 sentences."
token_count = len(enc.encode(user_message))
print(f"Prompt tokens: {token_count}")  # Budget-check before expensive call

# Streaming generation
stream = client.chat.completions.create(
    model="gpt-4o-mini",  # ~30x cheaper than gpt-4o, 90% quality
    messages=[
        {"role": "system", "content": "You are a concise ML expert."},
        {"role": "user", "content": user_message},
    ],
    max_tokens=200,
    temperature=0.7,
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()  # newline after streaming

Load Open-Weight LLM with HuggingFace¶

# pip install transformers>=4.40 torch>=2.3
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40, torch>=2.3
# Note: CPU-only is slow. GPU (CUDA or MPS) strongly recommended.
from transformers import pipeline

# Gemma 4 E2B: 2B params, ~5GB, runs on most GPUs
pipe = pipeline(
    "text-generation",
    model="google/gemma-2-2b-it",
    device_map="auto",  # auto-detect GPU/CPU
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
)

response = pipe([{"role": "user", "content": "What is an LLM?"}])
print(response[0]["generated_text"][-1]["content"])

★ Connections¶

Relationship	Topics
Builds on	Transformers, Attention Mechanism
Leads to	Rag, Fine Tuning, Ai Agents, Prompt Engineering
Compare with	Traditional NLP (rule-based), Smaller language models (BERT-era)
Cross-domain	Cognitive science (language understanding), Linguistics

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Hallucination in critical paths	Confident but factually wrong outputs in production	No grounding, no verification layer	RAG, citation requirements, confidence calibration
Prompt sensitivity	Small wording changes cause dramatically different outputs	Models are sensitive to prompt phrasing	Prompt testing suite, A/B test prompts, few-shot examples
Context window mismanagement	Truncated context or token limit errors	No token counting before API call	Pre-flight token counting, dynamic prompt assembly

◆ Hands-On Exercises¶

Exercise 1: Stress-Test an LLM's Boundaries¶

Goal: Systematically discover where an LLM fails Time: 30 minutes Steps: 1. Create 20 test cases: factual, reasoning, math, code, multilingual 2. Run through your production LLM 3. Grade each response (correct, partially correct, hallucinated, refused) 4. Create a failure pattern taxonomy Expected Output: Failure taxonomy with frequency counts per category

★ Recommended Resources¶

Type	Resource	Why
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 1-2	Best introduction to LLMs for practitioners
🎥 Video	Andrej Karpathy — "Intro to Large Language Models"	Best 1-hour overview of how LLMs work
📘 Book	"Build a Large Language Model (From Scratch)" by Sebastian Raschka (2024)	Implement an LLM from scratch in PyTorch
🔧 Hands-on	HuggingFace Transformers	Production library for working with LLMs

★ Sources¶

OpenAI GPT-5 release blog and model cards (2025-2026)
Anthropic Claude 4 model documentation (2025-2026)
Google Gemini release notes (2025-2026)
Meta LLaMA 4 announcement (April 2025)
Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022)
Sebastian Raschka, "LLM Year in Review 2025"