✨ Bit: Embeddings are how machines "understand" meaning — by turning everything (words, images, code) into lists of numbers where similar things are close together. "King - Man + Woman = Queen" is the most famous proof it works.
An embedding is a mapping from high-dimensional, sparse data (like words, sentences, or images) to a dense, lower-dimensional vector space where semantic relationships are preserved as geometric relationships (distance, direction).
Predict context words (skip-gram) or target from context (CBOW)
2014
GloVe
GloVe
Global word co-occurrence statistics
2018
Contextual
ELMo, BERT
Same word gets different embeddings based on context
2020+
Sentence Transformers
SBERT
Fine-tuned BERT for sentence-level similarity
2024+
Instruction-tuned
text-embedding-3, e5-instruct
Follow instructions like "Represent this for retrieval"
Key breakthrough: Contextual embeddings. "Bank" in "river bank" vs "bank account" gets DIFFERENT vectors. Pre-contextual embeddings gave it the same vector.
# ⚠️ Last tested: 2026-04importnumpyasnpdefcosine_similarity(a,b):"""Most common metric for text embeddings"""returnnp.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))# Example:embed_cat=[0.21,-0.55,0.89]embed_dog=[0.23,-0.51,0.85]embed_car=[-0.67,0.33,-0.12]cosine_similarity(embed_cat,embed_dog)# → 0.99 (very similar!)cosine_similarity(embed_cat,embed_car)# → 0.12 (very different)
⚠️ Embedding model for index ≠ query model = disaster: ALWAYS use the same model for embedding documents and queries.
⚠️ Long text ≠ good embedding: Most models have a max input (~8K tokens). Longer text gets truncated, losing info. Chunk first.
⚠️ Dimensions aren't free: 3072-dim vectors cost 2x storage/compute vs 1536-dim. Use the smallest that gives acceptable quality.
⚠️ Cosine similarity isn't everything: Two documents about different aspects of the same topic might have high similarity but not answer the same question. Task-specific fine-tuning helps.
⚠️ Don't ignore the MTEB leaderboard: The Massive Text Embedding Benchmark ranks models. Check it before choosing.
Q: What are embeddings and why do they matter for GenAI?
A: Embeddings map data to dense vectors where semantic similarity becomes geometric distance. They're the foundation of RAG (find relevant documents), semantic search (find by meaning), and even the first layer of every LLM. Without embeddings, modern AI can't represent or compare meaning.
Q: What's the difference between word embeddings and sentence embeddings?
A: Word embeddings (Word2Vec, GloVe) encode individual words — "bank" always gets the same vector. Sentence embeddings (SBERT, text-embedding-3) encode entire sentences with context — "river bank" and "bank robbery" get very different vectors. Modern systems use sentence/paragraph embeddings.
Exercise 1: Compare Embedding Models on Your Domain¶
Goal: Benchmark 3 embedding models on domain-specific retrieval
Time: 30 minutes
Steps:
1. Prepare 20 query-document pairs from your domain
2. Embed with OpenAI text-embedding-3-small, Cohere embed-v4, and sentence-transformers
3. Compute cosine similarity for each pair
4. Rank models by retrieval accuracy
Expected Output: Comparison table with Precision@5 per model