Embedding Fine-Tuning¶
✨ Bit: General-purpose embeddings work well for general-purpose tasks. But if your RAG system struggles with domain-specific queries (legal, medical, code), fine-tuning embeddings on your data can improve retrieval by 10-30% — often more impactful than changing the LLM.
★ TL;DR¶
- What: Training or adapting embedding models on domain-specific data to improve retrieval quality for RAG and search systems
- Why: Off-the-shelf embeddings (OpenAI, Cohere) are trained on general web data. Domain-specific terminology, jargon, and relationships aren't well captured.
- Key point: Fine-tuning embeddings is often the highest-ROI improvement for RAG systems — 10-30% retrieval quality improvement with relatively small training sets (1K-10K examples).
★ Overview¶
Definition¶
Embedding fine-tuning adapts a pretrained embedding model to a specific domain or task by training on labeled pairs (query, relevant_document) or triplets (query, positive, negative) using contrastive learning objectives.
Scope¶
Covers: When to fine-tune vs use off-the-shelf, training data generation, contrastive learning, practical fine-tuning with Sentence Transformers, evaluation. For embedding fundamentals, see Embeddings. For retrieval evaluation, see Retrieval Evaluation.
Prerequisites¶
- Embeddings — vector representation fundamentals
- RAG — retrieval architecture
- Fine-Tuning — general fine-tuning concepts
★ Deep Dive¶
When to Fine-Tune Embeddings¶
DECISION TREE:
Is your retrieval quality good enough (Recall@5 > 0.8)?
├── YES → Don't fine-tune. Focus on LLM/prompt improvements.
└── NO → Is the problem domain-specific vocabulary?
├── YES → Fine-tune embeddings (highest ROI)
└── NO → Check chunking, reranking, hybrid search first
└── Still bad? → Fine-tune embeddings
SIGNS YOU NEED EMBEDDING FINE-TUNING:
✗ Medical queries: "dyspnea" doesn't match "shortness of breath"
✗ Legal queries: "force majeure" doesn't find relevant contract clauses
✗ Code queries: "implement retry logic" doesn't find error handling code
✗ Internal jargon: Company-specific terms have no good embedding
Training Data for Embedding Fine-Tuning¶
| Data Type | Format | How to Generate | Volume Needed |
|---|---|---|---|
| Positive pairs | (query, relevant_doc) | From search logs, user clicks, manual labeling | 1K-10K pairs |
| Triplets | (query, positive_doc, negative_doc) | Positive from logs + hard negatives from retrieval | 5K-50K triplets |
| LLM-generated | Synthetic queries from documents | Use LLM to generate questions that each document answers | 1K-10K |
Contrastive Learning Objective¶
TRAINING OBJECTIVE: Pull matching pairs together, push non-matching apart
Query: "What causes high blood pressure?"
Positive doc: "Hypertension is caused by..." → PULL CLOSER
Negative doc: "Stock market trends in 2024..." → PUSH APART
Hard negative: "Blood pressure measurement..." → PUSH APART (harder!)
Loss function: InfoNCE / Multiple Negatives Ranking Loss
L = -log( exp(sim(q, d+)/τ) / Σ exp(sim(q, di)/τ) )
Where:
q = query embedding
d+ = positive document embedding
di = all documents in batch (positives + negatives)
τ = temperature (default 0.05)
sim = cosine similarity
★ Code & Implementation¶
Fine-Tune Embeddings with Sentence Transformers¶
# pip install sentence-transformers>=3.0 datasets>=2.0
# ⚠️ Last tested: 2026-04 | Requires: sentence-transformers>=3.0
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from torch.utils.data import DataLoader
# 1. Load base model
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
# 2. Prepare training data (query, positive_doc pairs)
train_examples = [
InputExample(texts=["What causes hypertension?",
"Hypertension is primarily caused by arterial stiffening..."]),
InputExample(texts=["Treatment for type 2 diabetes",
"First-line treatment for T2DM includes metformin..."]),
InputExample(texts=["Side effects of statins",
"Common statin side effects include myalgia..."]),
# ... add 1K-10K pairs for production quality
]
# 3. Create dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
# 4. Use Multiple Negatives Ranking Loss (best for retrieval)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
# 5. Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./models/medical-embeddings",
show_progress_bar=True,
)
# 6. Use fine-tuned model
model = SentenceTransformer("./models/medical-embeddings")
query_emb = model.encode("shortness of breath treatment")
doc_emb = model.encode("Dyspnea management includes bronchodilators...")
similarity = query_emb @ doc_emb / (
(query_emb @ query_emb) ** 0.5 * (doc_emb @ doc_emb) ** 0.5
)
print(f"Similarity: {similarity:.3f}")
# Expected: Higher similarity than base model (domain terms aligned)
Generate Training Data with LLM¶
# pip install openai>=1.0
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.0
from openai import OpenAI
import json
client = OpenAI()
def generate_training_pairs(documents: list[str], n_queries_per_doc: int = 3) -> list[dict]:
"""Generate synthetic (query, document) training pairs using an LLM."""
pairs = []
for doc in documents:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Generate {n_queries_per_doc} diverse search queries that this document would answer.
Return JSON array of strings.
Document: {doc[:2000]}"""
}],
response_format={"type": "json_object"},
temperature=0.7,
)
queries = json.loads(response.choices[0].message.content).get("queries", [])
for query in queries:
pairs.append({"query": query, "document": doc})
return pairs
# Usage
docs = [
"Metformin is the first-line treatment for type 2 diabetes mellitus...",
"Hypertension management begins with lifestyle modifications...",
]
training_data = generate_training_pairs(docs)
print(f"Generated {len(training_data)} training pairs")
# Expected: 6 query-document pairs ready for embedding fine-tuning
◆ Quick Reference¶
EMBEDDING FINE-TUNING CHECKLIST:
1. Baseline: Measure retrieval quality with off-the-shelf model
2. Data: Collect/generate 1K-10K (query, relevant_doc) pairs
3. Model: Start with a strong base (bge-small, e5-small, gte-base)
4. Train: 3-5 epochs, MNR loss, batch size 32-64
5. Evaluate: Compare Recall@5 and MRR before vs after
6. Deploy: Replace embedding model in your RAG pipeline
7. Monitor: Track retrieval quality in production
EXPECTED IMPROVEMENTS:
General domain: 5-10% improvement in Recall@5
Specialized domain: 10-30% improvement in Recall@5
With hard negatives: Additional 5-10% boost
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Catastrophic forgetting | Fine-tuned model worse on general queries | Overfitting to domain data, losing general knowledge | Use lower learning rate, fewer epochs, mix in general data |
| Low-quality training data | No improvement after fine-tuning | Training pairs are noisy, irrelevant, or too easy | Clean data, add hard negatives, use LLM-generated queries |
| Embedding dimension mismatch | Can't deploy — vector DB expects different dimensions | Changed model architecture during fine-tuning | Use same model family, or rebuild vector index |
○ Interview Angles¶
- Q: How would you improve retrieval quality in a RAG system?
- A: I'd follow a priority ladder. First, measure baseline retrieval quality (Precision@5, Recall@5) to quantify the gap. Second, check chunking — are chunks the right size (200-500 tokens) with enough context? Third, try hybrid search (semantic + keyword with BM25). Fourth, add a cross-encoder reranker on top-20 results. If the domain is specialized (medical, legal), I'd fine-tune the embedding model on 5K-10K domain-specific (query, document) pairs using contrastive learning — this typically gives 10-30% improvement on domain queries. I'd evaluate each change independently to measure its contribution.
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Embeddings, Fine-Tuning, RAG |
| Leads to | Domain-specific RAG, improved retrieval quality, Retrieval Evaluation |
| Compare with | Off-the-shelf embeddings, reranking (no fine-tuning needed) |
| Cross-domain | Information retrieval, search engineering, NLP |
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 🔧 Hands-on | Sentence Transformers Fine-Tuning Guide | Official guide for embedding fine-tuning |
| 📄 Paper | Xiao et al. "C-Pack: Packaged Resources for General Chinese Embeddings" (BGE, 2023) | How BAAI/bge models are trained — informative for fine-tuning strategy |
| 📘 Book | "AI Engineering" by Chip Huyen (2025), Ch 3 | Embedding selection and optimization in RAG |
| 🔧 Hands-on | MTEB Leaderboard | Compare embedding model quality before choosing a base |
◆ Hands-On Exercises¶
Exercise 1: Fine-Tune an Embedding Model on Your Domain¶
Goal: Improve retrieval quality by fine-tuning embeddings on domain data Time: 45 minutes Steps: 1. Create 200 positive pairs (query, relevant document) from your domain 2. Generate hard negatives using BM25 retrieval 3. Fine-tune a sentence-transformers model with MultipleNegativesRankingLoss 4. Compare retrieval metrics before and after fine-tuning Expected Output: MRR improvement table showing fine-tuned model outperforms base
★ Sources¶
- Sentence Transformers Documentation — https://www.sbert.net/
- MTEB Embedding Benchmark — https://huggingface.co/spaces/mteb/leaderboard
- Reimers & Gurevych "Sentence-BERT" (2019)
- Embeddings
- RAG