Embeddings¶

✨ Bit: Embeddings are how machines "understand" meaning — by turning everything (words, images, code) into lists of numbers where similar things are close together. "King - Man + Woman = Queen" is the most famous proof it works.

★ TL;DR¶

What: Dense vector representations that encode the meaning/semantics of data (text, images, audio) into fixed-size arrays of numbers
Why: THE bridge between human-readable content and machine computation. Without embeddings, there's no RAG, no semantic search, no modern AI
Key point: Similar meanings → nearby vectors. "Dog" and "puppy" are close. "Dog" and "cryptocurrency" are far apart.

★ Overview¶

Definition¶

An embedding is a mapping from high-dimensional, sparse data (like words, sentences, or images) to a dense, lower-dimensional vector space where semantic relationships are preserved as geometric relationships (distance, direction).

Scope¶

Covers: What embeddings are, how they work, types, models, and practical usage. For how embeddings power RAG, see Retrieval-Augmented Generation (RAG). For databases that store them, see Vector Databases.

Significance¶

Foundation of RAG (embedding queries + documents for retrieval)
Foundation of semantic search (replace keyword matching with meaning matching)
Used in recommendation systems, anomaly detection, clustering
Every LLM internally uses embeddings as the first processing step

Prerequisites¶

Basic Linear Algebra For Ai — vectors, dot products
Understanding of what Large Language Models (LLMs) do

★ Deep Dive¶

The Core Idea¶

TRADITIONAL REPRESENTATION (sparse, no meaning):
  "cat"  → [0, 0, 0, 1, 0, 0, 0, ..., 0]  (one-hot, 50K+ dimensions)
  "dog"  → [0, 0, 1, 0, 0, 0, 0, ..., 0]
  Problem: "cat" and "dog" are equidistant from each other AND from "quantum"

EMBEDDING REPRESENTATION (dense, captures meaning):
  "cat"  → [0.21, -0.55, 0.89, 0.12, ..., 0.45]  (768-3072 dimensions)
  "dog"  → [0.23, -0.51, 0.85, 0.15, ..., 0.43]  ← CLOSE to cat!
  "quantum" → [-0.67, 0.33, -0.12, 0.91, ..., -0.28]  ← FAR from both

How Embeddings Are Created¶

┌──────────────────────────────────────────────────────────┐
│                  EMBEDDING PIPELINE                       │
│                                                          │
│  Text: "Transformers revolutionized NLP"                 │
│                    │                                     │
│                    ▼                                     │
│  ┌──────────────────────────────────┐                    │
│  │     EMBEDDING MODEL              │                    │
│  │  (trained on massive text pairs) │                    │
│  │                                  │                    │
│  │  "This sentence" ↔ "That sentence"                   │
│  │  Similar? → Vectors close                             │
│  │  Different? → Vectors far                             │
│  └──────────────────┬───────────────┘                    │
│                     ▼                                    │
│  Vector: [0.12, -0.45, 0.89, ..., 0.33]                │
│           └──────── 768-3072 numbers ─────┘              │
└──────────────────────────────────────────────────────────┘

Types of Embeddings¶

Type	What It Embeds	Dimension	Use Case
Word embeddings	Individual words	100-300	Legacy (Word2Vec, GloVe)
Sentence/Document	Entire text chunks	768-3072	RAG, semantic search
Code embeddings	Source code	768-1536	Code search, deduplication
Image embeddings	Images	512-2048	Image search, CLIP
Multimodal	Text AND images in same space	512-1024	Cross-modal search (CLIP)
Token embeddings	Individual tokens (inside LLMs)	Model-dependent	Internal LLM processing

Evolution of Text Embeddings¶

Era	Method	Key Model	How It Works
2013	Word2Vec	Word2Vec	Predict context words (skip-gram) or target from context (CBOW)
2014	GloVe	GloVe	Global word co-occurrence statistics
2018	Contextual	ELMo, BERT	Same word gets different embeddings based on context
2020+	Sentence Transformers	SBERT	Fine-tuned BERT for sentence-level similarity
2024+	Instruction-tuned	text-embedding-3, e5-instruct	Follow instructions like "Represent this for retrieval"

Key breakthrough: Contextual embeddings. "Bank" in "river bank" vs "bank account" gets DIFFERENT vectors. Pre-contextual embeddings gave it the same vector.

Current Best Embedding Models (March 2026)¶

Model	Provider	Dimensions	Best For
`text-embedding-3-large`	OpenAI	3072	Highest quality (API)
`text-embedding-3-small`	OpenAI	1536	Best budget option (API)
`embed-v4`	Cohere	1024	Multilingual
`bge-m3`	BAAI	1024	Best open-source, multilingual
`e5-mistral-7b-instruct`	Microsoft	4096	Highest quality open-source
`nomic-embed-text-v1.5`	Nomic	768	Good small open-source
`jina-embeddings-v3`	Jina AI	1024	Task-specific dimensions

Similarity Measurement¶

# ⚠️ Last tested: 2026-04
import numpy as np

def cosine_similarity(a, b):
    """Most common metric for text embeddings"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example:
embed_cat = [0.21, -0.55, 0.89]
embed_dog = [0.23, -0.51, 0.85]
embed_car = [-0.67, 0.33, -0.12]

cosine_similarity(embed_cat, embed_dog)  # → 0.99 (very similar!)
cosine_similarity(embed_cat, embed_car)  # → 0.12 (very different)

Metric	When to Use	Range
Cosine Similarity	Normalized text embeddings (most common)	-1 to 1
Euclidean Distance	When magnitude matters	0 to ∞
Dot Product	Already-normalized vectors, fast	-∞ to ∞

◆ Code & Implementation¶

# ⚠️ Last tested: 2026-04
# ═══ OPENAI EMBEDDINGS ═══
from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Transformers use self-attention"
)
vector = response.data[0].embedding  # List of 1536 floats
print(f"Dimensions: {len(vector)}")  # 1536

# ═══ OPEN-SOURCE (Sentence Transformers) ═══
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")
texts = ["How does RAG work?", "Retrieval augmented generation explained", "Best pizza in NYC"]
embeddings = model.encode(texts)

# Check similarity
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeddings)
# texts[0] vs texts[1]: ~0.92 (semantically similar!)
# texts[0] vs texts[2]: ~0.15 (unrelated)

# ═══ LOCAL WITH OLLAMA ═══
import ollama
response = ollama.embeddings(model="nomic-embed-text", prompt="Hello, world!")
vector = response["embedding"]  # 768 dimensions

◆ Strengths vs Limitations¶

✅ Strengths	❌ Limitations
Capture semantic meaning (synonyms, concepts)	Fixed-size: long documents squeezed into same dimensions as short
Enable similarity search at scale	Black box: hard to interpret what each dimension means
Work across languages (multilingual models)	Quality depends heavily on model choice
Cheap to compute and store	Domain-specific text may need fine-tuned embeddings
Foundation for RAG, search, recommendations	Can encode biases from training data

◆ Quick Reference¶

CHOOSING AN EMBEDDING MODEL:
  Best quality (API): text-embedding-3-large (OpenAI)
  Best budget (API): text-embedding-3-small (OpenAI)
  Best open-source: bge-m3 or e5-mistral-7b
  Multilingual: bge-m3 or Cohere embed-v4
  Local/offline: nomic-embed-text via Ollama

KEY NUMBERS:
  Dimensions: 768-3072 (typical)
  Storage: ~6KB per embedding (1536 dims × 4 bytes)
  1M documents × 1536 dims = ~6 GB storage

SIMILARITY THRESHOLDS (cosine, rough guide):
  > 0.9  = Very similar / near-duplicate
  0.7-0.9 = Related / relevant
  0.4-0.7 = Loosely related
  < 0.4  = Probably unrelated

○ Gotchas & Common Mistakes¶

⚠️ Embedding model for index ≠ query model = disaster: ALWAYS use the same model for embedding documents and queries.
⚠️ Long text ≠ good embedding: Most models have a max input (~8K tokens). Longer text gets truncated, losing info. Chunk first.
⚠️ Dimensions aren't free: 3072-dim vectors cost 2x storage/compute vs 1536-dim. Use the smallest that gives acceptable quality.
⚠️ Cosine similarity isn't everything: Two documents about different aspects of the same topic might have high similarity but not answer the same question. Task-specific fine-tuning helps.
⚠️ Don't ignore the MTEB leaderboard: The Massive Text Embedding Benchmark ranks models. Check it before choosing.

○ Interview Angles¶

Q: What are embeddings and why do they matter for GenAI?
A: Embeddings map data to dense vectors where semantic similarity becomes geometric distance. They're the foundation of RAG (find relevant documents), semantic search (find by meaning), and even the first layer of every LLM. Without embeddings, modern AI can't represent or compare meaning.
Q: What's the difference between word embeddings and sentence embeddings?
A: Word embeddings (Word2Vec, GloVe) encode individual words — "bank" always gets the same vector. Sentence embeddings (SBERT, text-embedding-3) encode entire sentences with context — "river bank" and "bank robbery" get very different vectors. Modern systems use sentence/paragraph embeddings.

★ Connections¶

Relationship	Topics
Builds on	Linear Algebra For Ai, Neural network representations
Leads to	Retrieval-Augmented Generation (RAG), Vector Databases, Semantic search
Compare with	One-hot encoding (sparse), TF-IDF (sparse, frequency-based), Knowledge graphs
Cross-domain	Recommendation systems, Computer vision (image embeddings), Bioinformatics

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Embedding drift	Similarity scores become unreliable over time	Model update changes embedding space without re-indexing	Version-locked models, full re-index on model change
Domain mismatch	General embeddings perform poorly on specialized content	Model trained on general web data	Domain-specific fine-tuning, domain-adapted models
Dimensionality curse	High-dimensional embeddings slow retrieval without quality gain	Using 1536-dim when 256-dim suffices	Matryoshka embeddings, PCA reduction, benchmark at multiple dims
Cosine similarity ceiling	All documents score 0.7-0.9, no discrimination	Embedding space too clustered for domain	Calibrated thresholds, reranking layer, hybrid retrieval

◆ Hands-On Exercises¶

Exercise 1: Compare Embedding Models on Your Domain¶

Goal: Benchmark 3 embedding models on domain-specific retrieval Time: 30 minutes Steps: 1. Prepare 20 query-document pairs from your domain 2. Embed with OpenAI text-embedding-3-small, Cohere embed-v4, and sentence-transformers 3. Compute cosine similarity for each pair 4. Rank models by retrieval accuracy Expected Output: Comparison table with Precision@5 per model

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Mikolov et al. "Word2Vec" (2013)	Where it all started — skip-gram and CBOW
🎥 Video	Jay Alammar — "The Illustrated Word2vec"	Best visual explanation of word embeddings
🔧 Hands-on	OpenAI Embeddings Guide	Practical guide to using production embeddings
📘 Book	"Speech and Language Processing" by Jurafsky & Martin, Ch 6	Authoritative textbook treatment of vector semantics

★ Sources¶

Mikolov et al., "Efficient Estimation of Word Representations" (Word2Vec, 2013)
Reimers & Gurevych, "Sentence-BERT" (2019)
OpenAI Embeddings documentation — https://platform.openai.com/docs/guides/embeddings
MTEB Leaderboard — https://huggingface.co/spaces/mteb/leaderboard