Tokenization¶
✨ Bit: LLMs don't see words. They see token IDs. "Hello" =
[15496]. This is why they can't count the letters in "strawberry" — they see something like[" straw", "berry"], not individual characters.
★ TL;DR¶
- What: The process of breaking text into sub-word units (tokens) that LLMs actually process
- Why: Models can't process raw text. Tokenization determines what the model "sees" — and affects cost, multilingual performance, and model behavior
- Key point: ~4 English characters ≈ 1 token. Non-English text is often 2-3x more tokens per word (= more expensive, slower).
★ Overview¶
Definition¶
Tokenization is the process of converting raw text into a sequence of discrete units (tokens) that a language model can process. Modern tokenizers use sub-word algorithms that split text into pieces between individual characters and full words, creating a vocabulary that balances expressiveness with efficiency.
Scope¶
Covers tokenization algorithms (BPE, WordPiece, SentencePiece), practical implications, and model-specific tokenizers. For what happens after tokenization (embeddings), see Embeddings.
Significance¶
- Directly affects API cost (pricing is per-token)
- Determines multilingual capability (poor tokenizers = expensive for non-English)
- Explains many LLM "failures" (can't count letters, bad at math = tokenization artifacts)
- Different models use different tokenizers — tokens aren't transferable
Prerequisites¶
- Basic understanding of what Large Language Models (LLMs) are
★ Deep Dive¶
Why Not Just Use Words or Characters?¶
WORD-LEVEL:
Vocabulary: Every word in every language = unlimited
Problem: "unhappiness" = unknown word? Need infinite vocabulary.
CHARACTER-LEVEL:
Vocabulary: ~100 characters (a-z, A-Z, 0-9, punctuation)
Problem: "transformer" = 11 tokens. Sequences become way too long.
SUB-WORD (what we actually use):
Vocabulary: 32K - 128K tokens
"unhappiness" → ["un", "happiness"] ← Common words stay whole
"transformer" → ["transform", "er"] ← Rare words split intelligently
Balance: Compact vocabulary + handles any text
How BPE Works (Most Common Algorithm)¶
Byte Pair Encoding (BPE) — used by GPT, LLaMA, Mistral:
START: Split everything into characters
"lower" → ['l', 'o', 'w', 'e', 'r']
"lowest" → ['l', 'o', 'w', 'e', 's', 't']
STEP 1: Find most frequent pair → ('l', 'o') appears most
Merge: "lo" becomes a token
"lower" → ['lo', 'w', 'e', 'r']
STEP 2: Find most frequent pair → ('lo', 'w')
Merge: "low" becomes a token
"lower" → ['low', 'e', 'r']
STEP 3: ('e', 'r') is frequent
Merge: "er" becomes a token
"lower" → ['low', 'er']
STEP 4: ('low', 'er') is frequent
Merge: "lower" becomes a token
"lower" → ['lower']
... Continue until vocabulary reaches target size (32K-128K)
Tokenization Algorithms Compared¶
| Algorithm | Used By | Key Difference |
|---|---|---|
| BPE | GPT-2/3/4/5, LLaMA, Mistral | Merge most frequent byte pairs bottom-up |
| WordPiece | BERT, DistilBERT | Like BPE but maximizes likelihood of training data |
| Unigram | T5, ALBERT (via SentencePiece) | Starts with large vocab, prunes least useful tokens |
| SentencePiece | LLaMA, Gemma, T5 | Language-agnostic, treats input as raw bytes (no pre-tokenization) |
Practical Impact¶
# ⚠️ Last tested: 2026-04
# Using OpenAI's tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
# English is efficient:
tokens = enc.encode("Hello, how are you?")
print(len(tokens)) # 6 tokens
print(tokens) # [9906, 11, 1268, 527, 499, 30]
# Code is less efficient:
tokens = enc.encode("def calculate_fibonacci(n):")
print(len(tokens)) # 7 tokens
# Non-English is expensive:
tokens = enc.encode("नमस्ते, आप कैसे हैं?") # Hindi
print(len(tokens)) # 20+ tokens for same meaning!
# This means Hindi/Arabic/CJK users pay 2-4x more per API call
Vocabulary Sizes Across Models¶
| Model | Vocab Size | Tokenizer |
|---|---|---|
| GPT-2 | 50,257 | BPE (tiktoken) |
| GPT-4 / GPT-4o | 100,277 | BPE (tiktoken, cl100k_base) |
| GPT-5 | ~200,000 | BPE (tiktoken, o200k_base) |
| LLaMA 2 | 32,000 | SentencePiece (BPE) |
| LLaMA 3/4 | 128,256 | SentencePiece (BPE) |
| Gemini | 256,000 | SentencePiece |
| Claude | ~100,000 | BPE variant |
Trend: Vocab sizes are GROWING. Larger vocab = fewer tokens per text = faster inference but larger embedding table.
Special Tokens¶
Common special tokens across models:
<|begin_of_text|> → Start of sequence
<|end_of_text|> → End of sequence / stop generating
<|im_start|> → Start of a message (ChatML format)
<|im_end|> → End of a message
<|system|> → System prompt marker
These are NOT part of the text — they're control signals the model was trained on.
Different models use different special tokens (not compatible).
◆ Strengths vs Limitations¶
| ✅ Strengths | ❌ Limitations |
|---|---|
| Sub-word = handles any text (even made-up words) | Non-English languages get more tokens per word = inequity |
| Fixed vocabulary = predictable model size | Arithmetic is hard (numbers split unpredictably) |
| BPE is fast and deterministic | Character-level tasks (counting letters, reversal) fail |
| Tokenizers are model-specific and well-tested | Can't easily swap tokenizers between models |
◆ Quick Reference¶
TOKEN ESTIMATION:
1 token ≈ 4 characters (English)
1 token ≈ ¾ of a word (English)
100 tokens ≈ 75 words
1 page ≈ 300 tokens
COST IMPACT (example at $3/1M tokens):
1,000 word document ≈ 1,300 tokens ≈ $0.004
Full book (80K words) ≈ 100K tokens ≈ $0.30
TOOLS:
tiktoken (OpenAI): pip install tiktoken
tokenizers (HuggingFace): pip install tokenizers
Online: platform.openai.com/tokenizer
○ Gotchas & Common Mistakes¶
- ⚠️ "Why can't GPT count letters in strawberry?" — Because it sees
["straw", "berry"], not individual characters. It literally can't see the letters. - ⚠️ Token ≠ word: Never estimate costs by word count. Always use the tokenizer library.
- ⚠️ Multilingual cost surprise: Hindi/Arabic/Japanese text can be 2-4x more tokens than equivalent English.
- ⚠️ Context window is in tokens: "128K context" means 128K tokens, not characters or words. In English, that's roughly 96K words.
- ⚠️ Leading whitespace matters:
" hello"(with space) and"hello"are often different tokens.
○ Interview Angles¶
- Q: Why do LLMs use sub-word tokenization instead of word-level?
-
A: Word-level requires an impossibly large vocabulary (every word in every language) and can't handle misspellings, new words, or code. Sub-word splits rare words into common pieces ("unhappiness" → ["un", "happiness"]) while keeping frequent words whole. Fixed vocab size (~32K-128K), handles any input.
-
Q: Why is tokenization a source of bias?
- A: Languages with less representation in training data get worse tokenization — more tokens per word. This means non-English users spend more money, get slower responses, and use more of their context window for the same content. Larger vocabularies (LLaMA 3's 128K vs LLaMA 2's 32K) help mitigate this.
★ Code & Implementation¶
Token Cost Calculator (tiktoken)¶
# pip install tiktoken>=0.6
# ⚠️ Last tested: 2026-04 | Requires: tiktoken>=0.6
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
def token_cost_report(texts: dict[str, str], price_per_1m: float = 2.50) -> None:
"""Report token count and estimated cost for a dict of text samples."""
print(f"{'Label':<25} {'Tokens':>8} {'Cost ($)':>12}")
print("-" * 48)
for label, text in texts.items():
n = len(enc.encode(text))
cost = (n / 1_000_000) * price_per_1m
print(f"{label:<25} {n:>8,} {cost:>12.6f}")
samples = {
"English (100 words)": "The quick brown fox jumps over the lazy dog. " * 5,
"Code (Python func)": "def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)\n" * 3,
"Hindi (same meaning)": "नमस्ते, आप कैसे हैं? मैं ठीक हूँ। " * 10, # 2-4x more tokens
"JSON (structured)": '{"name": "Alice", "age": 30, "city": "Tokyo"}\n' * 10,
}
token_cost_report(samples)
# English: ~75 tokens — Hindi: ~200+ tokens for equivalent content
Cross-Model Token Comparison¶
# ⚠️ Last tested: 2026-04 | Requires: tiktoken>=0.6, transformers>=4.40
import tiktoken
from transformers import AutoTokenizer
text = "The transformer architecture revolutionized natural language processing in 2017."
# OpenAI tokenizers
for model in ["gpt-4o", "gpt-3.5-turbo"]:
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
print(f"OpenAI {model}: {len(tokens)} tokens → {tokens}")
# HuggingFace tokenizers
for hf_model in ["meta-llama/Llama-3.2-1B", "google/gemma-2-2b"]:
tok = AutoTokenizer.from_pretrained(hf_model)
tokens = tok.encode(text)
print(f"HF {hf_model.split('/')[-1]}: {len(tokens)} tokens")
# Different tokenizers → different counts for same text
# This is why you MUST use the correct tokenizer for each model
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | String processing, Compression algorithms (BPE originated in compression) |
| Leads to | Embeddings (tokens → vectors), Large Language Models (LLMs) |
| Compare with | Character encoding (ASCII/UTF-8), Word-level parsing |
| Cross-domain | Linguistic morphology, Data compression |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Token count mismatch | Context window overflow despite short text | Different tokenizers produce different counts; code/non-English inflate tokens | Use exact model tokenizer (tiktoken), not character heuristics |
| Multilingual over-tokenization | Non-English text uses 2-5x more tokens | BPE trained primarily on English corpus | Multilingual tokenizers, language-specific models |
| Special token injection | User input contains control tokens that alter behavior | No input sanitization | Strip or escape special tokens from user input |
◆ Hands-On Exercises¶
Exercise 1: Token Economics Calculator¶
Goal: Build a tool that estimates API cost from text input Time: 20 minutes Steps: 1. Use tiktoken to count tokens for 10 sample texts (English, code, multilingual) 2. Calculate cost at GPT-4o pricing per input/output 3. Compare token counts across languages and content types Expected Output: Cost table showing 2-3x variance across languages
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Sennrich et al. "BPE for Neural Machine Translation" (2016) | The paper that introduced BPE to NLP |
| 🎥 Video | Andrej Karpathy — "Let's Build GPT Tokenizer" | Build BPE from scratch — best practical walkthrough |
| 🔧 Hands-on | HuggingFace Tokenizers Library | Fast, production-grade tokenizer implementations |
| 📘 Book | "Build a Large Language Model (From Scratch)" by Sebastian Raschka (2024), Ch 2 | Tokenizer implementation with BPE and SentencePiece |
★ Sources¶
- Sennrich et al., "Neural Machine Translation of Rare Words with Subword Units" (BPE, 2016)
- Kudo & Richardson, "SentencePiece: A simple and language independent subword tokenizer" (2018)
- OpenAI tiktoken — https://github.com/openai/tiktoken
- Hugging Face Tokenizers — https://huggingface.co/docs/tokenizers