Skip to content

Tokenization

Bit: LLMs don't see words. They see token IDs. "Hello" = [15496]. This is why they can't count the letters in "strawberry" — they see something like [" straw", "berry"], not individual characters.


★ TL;DR

  • What: The process of breaking text into sub-word units (tokens) that LLMs actually process
  • Why: Models can't process raw text. Tokenization determines what the model "sees" — and affects cost, multilingual performance, and model behavior
  • Key point: ~4 English characters ≈ 1 token. Non-English text is often 2-3x more tokens per word (= more expensive, slower).

★ Overview

Definition

Tokenization is the process of converting raw text into a sequence of discrete units (tokens) that a language model can process. Modern tokenizers use sub-word algorithms that split text into pieces between individual characters and full words, creating a vocabulary that balances expressiveness with efficiency.

Scope

Covers tokenization algorithms (BPE, WordPiece, SentencePiece), practical implications, and model-specific tokenizers. For what happens after tokenization (embeddings), see Embeddings.

Significance

  • Directly affects API cost (pricing is per-token)
  • Determines multilingual capability (poor tokenizers = expensive for non-English)
  • Explains many LLM "failures" (can't count letters, bad at math = tokenization artifacts)
  • Different models use different tokenizers — tokens aren't transferable

Prerequisites


★ Deep Dive

Why Not Just Use Words or Characters?

WORD-LEVEL:
  Vocabulary: Every word in every language = unlimited
  Problem: "unhappiness" = unknown word? Need infinite vocabulary.

CHARACTER-LEVEL:
  Vocabulary: ~100 characters (a-z, A-Z, 0-9, punctuation)
  Problem: "transformer" = 11 tokens. Sequences become way too long.

SUB-WORD (what we actually use):
  Vocabulary: 32K - 128K tokens
  "unhappiness" → ["un", "happiness"]  ← Common words stay whole
  "transformer" → ["transform", "er"]   ← Rare words split intelligently
  Balance: Compact vocabulary + handles any text

How BPE Works (Most Common Algorithm)

Byte Pair Encoding (BPE) — used by GPT, LLaMA, Mistral:

START: Split everything into characters
  "lower" → ['l', 'o', 'w', 'e', 'r']
  "lowest" → ['l', 'o', 'w', 'e', 's', 't']

STEP 1: Find most frequent pair → ('l', 'o') appears most
  Merge: "lo" becomes a token
  "lower" → ['lo', 'w', 'e', 'r']

STEP 2: Find most frequent pair → ('lo', 'w')
  Merge: "low" becomes a token
  "lower" → ['low', 'e', 'r']

STEP 3: ('e', 'r') is frequent
  Merge: "er" becomes a token
  "lower" → ['low', 'er']

STEP 4: ('low', 'er') is frequent
  Merge: "lower" becomes a token
  "lower" → ['lower']

... Continue until vocabulary reaches target size (32K-128K)

Tokenization Algorithms Compared

Algorithm Used By Key Difference
BPE GPT-2/3/4/5, LLaMA, Mistral Merge most frequent byte pairs bottom-up
WordPiece BERT, DistilBERT Like BPE but maximizes likelihood of training data
Unigram T5, ALBERT (via SentencePiece) Starts with large vocab, prunes least useful tokens
SentencePiece LLaMA, Gemma, T5 Language-agnostic, treats input as raw bytes (no pre-tokenization)

Practical Impact

# ⚠️ Last tested: 2026-04
# Using OpenAI's tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

# English is efficient:
tokens = enc.encode("Hello, how are you?")
print(len(tokens))         # 6 tokens
print(tokens)              # [9906, 11, 1268, 527, 499, 30]

# Code is less efficient:
tokens = enc.encode("def calculate_fibonacci(n):")
print(len(tokens))         # 7 tokens

# Non-English is expensive:
tokens = enc.encode("नमस्ते, आप कैसे हैं?")  # Hindi
print(len(tokens))         # 20+ tokens for same meaning!

# This means Hindi/Arabic/CJK users pay 2-4x more per API call

Vocabulary Sizes Across Models

Model Vocab Size Tokenizer
GPT-2 50,257 BPE (tiktoken)
GPT-4 / GPT-4o 100,277 BPE (tiktoken, cl100k_base)
GPT-5 ~200,000 BPE (tiktoken, o200k_base)
LLaMA 2 32,000 SentencePiece (BPE)
LLaMA 3/4 128,256 SentencePiece (BPE)
Gemini 256,000 SentencePiece
Claude ~100,000 BPE variant

Trend: Vocab sizes are GROWING. Larger vocab = fewer tokens per text = faster inference but larger embedding table.

Special Tokens

Common special tokens across models:

<|begin_of_text|>   → Start of sequence
<|end_of_text|>     → End of sequence / stop generating
<|im_start|>        → Start of a message (ChatML format)
<|im_end|>          → End of a message
<|system|>          → System prompt marker

These are NOT part of the text — they're control signals the model was trained on.
Different models use different special tokens (not compatible).

◆ Strengths vs Limitations

✅ Strengths ❌ Limitations
Sub-word = handles any text (even made-up words) Non-English languages get more tokens per word = inequity
Fixed vocabulary = predictable model size Arithmetic is hard (numbers split unpredictably)
BPE is fast and deterministic Character-level tasks (counting letters, reversal) fail
Tokenizers are model-specific and well-tested Can't easily swap tokenizers between models

◆ Quick Reference

TOKEN ESTIMATION:
  1 token ≈ 4 characters (English)
  1 token ≈ ¾ of a word (English)
  100 tokens ≈ 75 words
  1 page ≈ 300 tokens

COST IMPACT (example at $3/1M tokens):
  1,000 word document ≈ 1,300 tokens ≈ $0.004
  Full book (80K words) ≈ 100K tokens ≈ $0.30

TOOLS:
  tiktoken (OpenAI): pip install tiktoken
  tokenizers (HuggingFace): pip install tokenizers
  Online: platform.openai.com/tokenizer

○ Gotchas & Common Mistakes

  • ⚠️ "Why can't GPT count letters in strawberry?" — Because it sees ["straw", "berry"], not individual characters. It literally can't see the letters.
  • ⚠️ Token ≠ word: Never estimate costs by word count. Always use the tokenizer library.
  • ⚠️ Multilingual cost surprise: Hindi/Arabic/Japanese text can be 2-4x more tokens than equivalent English.
  • ⚠️ Context window is in tokens: "128K context" means 128K tokens, not characters or words. In English, that's roughly 96K words.
  • ⚠️ Leading whitespace matters: " hello" (with space) and "hello" are often different tokens.

○ Interview Angles

  • Q: Why do LLMs use sub-word tokenization instead of word-level?
  • A: Word-level requires an impossibly large vocabulary (every word in every language) and can't handle misspellings, new words, or code. Sub-word splits rare words into common pieces ("unhappiness" → ["un", "happiness"]) while keeping frequent words whole. Fixed vocab size (~32K-128K), handles any input.

  • Q: Why is tokenization a source of bias?

  • A: Languages with less representation in training data get worse tokenization — more tokens per word. This means non-English users spend more money, get slower responses, and use more of their context window for the same content. Larger vocabularies (LLaMA 3's 128K vs LLaMA 2's 32K) help mitigate this.

★ Code & Implementation

Token Cost Calculator (tiktoken)

# pip install tiktoken>=0.6
# ⚠️ Last tested: 2026-04 | Requires: tiktoken>=0.6
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

def token_cost_report(texts: dict[str, str], price_per_1m: float = 2.50) -> None:
    """Report token count and estimated cost for a dict of text samples."""
    print(f"{'Label':<25} {'Tokens':>8} {'Cost ($)':>12}")
    print("-" * 48)
    for label, text in texts.items():
        n = len(enc.encode(text))
        cost = (n / 1_000_000) * price_per_1m
        print(f"{label:<25} {n:>8,} {cost:>12.6f}")

samples = {
    "English (100 words)":  "The quick brown fox jumps over the lazy dog. " * 5,
    "Code (Python func)":   "def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)\n" * 3,
    "Hindi (same meaning)":  "नमस्ते, आप कैसे हैं? मैं ठीक हूँ। " * 10,   # 2-4x more tokens
    "JSON (structured)":    '{"name": "Alice", "age": 30, "city": "Tokyo"}\n' * 10,
}
token_cost_report(samples)
# English: ~75 tokens  — Hindi: ~200+ tokens for equivalent content

Cross-Model Token Comparison

# ⚠️ Last tested: 2026-04 | Requires: tiktoken>=0.6, transformers>=4.40
import tiktoken
from transformers import AutoTokenizer

text = "The transformer architecture revolutionized natural language processing in 2017."

# OpenAI tokenizers
for model in ["gpt-4o", "gpt-3.5-turbo"]:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    print(f"OpenAI {model}: {len(tokens)} tokens → {tokens}")

# HuggingFace tokenizers
for hf_model in ["meta-llama/Llama-3.2-1B", "google/gemma-2-2b"]:
    tok = AutoTokenizer.from_pretrained(hf_model)
    tokens = tok.encode(text)
    print(f"HF {hf_model.split('/')[-1]}: {len(tokens)} tokens")
# Different tokenizers → different counts for same text
# This is why you MUST use the correct tokenizer for each model

★ Connections

Relationship Topics
Builds on String processing, Compression algorithms (BPE originated in compression)
Leads to Embeddings (tokens → vectors), Large Language Models (LLMs)
Compare with Character encoding (ASCII/UTF-8), Word-level parsing
Cross-domain Linguistic morphology, Data compression

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Token count mismatch Context window overflow despite short text Different tokenizers produce different counts; code/non-English inflate tokens Use exact model tokenizer (tiktoken), not character heuristics
Multilingual over-tokenization Non-English text uses 2-5x more tokens BPE trained primarily on English corpus Multilingual tokenizers, language-specific models
Special token injection User input contains control tokens that alter behavior No input sanitization Strip or escape special tokens from user input

◆ Hands-On Exercises

Exercise 1: Token Economics Calculator

Goal: Build a tool that estimates API cost from text input Time: 20 minutes Steps: 1. Use tiktoken to count tokens for 10 sample texts (English, code, multilingual) 2. Calculate cost at GPT-4o pricing per input/output 3. Compare token counts across languages and content types Expected Output: Cost table showing 2-3x variance across languages


Type Resource Why
📄 Paper Sennrich et al. "BPE for Neural Machine Translation" (2016) The paper that introduced BPE to NLP
🎥 Video Andrej Karpathy — "Let's Build GPT Tokenizer" Build BPE from scratch — best practical walkthrough
🔧 Hands-on HuggingFace Tokenizers Library Fast, production-grade tokenizer implementations
📘 Book "Build a Large Language Model (From Scratch)" by Sebastian Raschka (2024), Ch 2 Tokenizer implementation with BPE and SentencePiece

★ Sources

  • Sennrich et al., "Neural Machine Translation of Rare Words with Subword Units" (BPE, 2016)
  • Kudo & Richardson, "SentencePiece: A simple and language independent subword tokenizer" (2018)
  • OpenAI tiktoken — https://github.com/openai/tiktoken
  • Hugging Face Tokenizers — https://huggingface.co/docs/tokenizers