Synthetic Data & Data Engineering for LLMs¶
✨ Bit: We've read the entire internet. Literally — we ran out of public text data around 2024. So what do we do? We use AI to generate MORE training data for AI. Sounds circular, but it works — if you're careful.
★ TL;DR¶
- What: Generating artificial training data using LLMs and curating/filtering real data for training
- Why: Training data is the MOAT. The quality, format, and diversity of data determine model quality more than architecture or compute.
- Key point: Models like Phi-3, Orca-2, and OpenHermes were trained almost entirely on synthetic data generated by GPT-4/Claude — and they're shockingly good. Data engineering IS model engineering.
★ Overview¶
Definition¶
- Synthetic data: AI-generated training examples (instruction-response pairs, reasoning chains, code) used to train or fine-tune other models
- Data engineering for LLMs: The full pipeline of collecting, filtering, deduplicating, formatting, and curating training data
Scope¶
Covers data generation methods, quality filtering, and format standards. For fine-tuning methods (LoRA, QLoRA), see Fine Tuning. For distillation as a training method, see Distillation And Compression.
Significance¶
- "Data is the new model architecture" — Microsoft Phi team
- $0 data generation budget with open models (no API costs)
- Understanding data formats is essential for fine-tuning
- Interview: "How would you create training data for a domain-specific LLM?"
★ Deep Dive¶
The Data-Centric AI Shift¶
MODEL-CENTRIC (old):
Fixed data + better architecture/training = better model
DATA-CENTRIC (now):
Fixed architecture + better data = better model
EVIDENCE:
Phi-3 (3.8B) ≈ GPT-3.5 quality ← HOW?
→ Trained on ~3.3T tokens of heavily filtered + synthetic data
→ Data quality > model size
Synthetic Data Generation Methods¶
| Method | How It Works | Example |
|---|---|---|
| Self-Instruct | Model generates instructions + answers from seeds | Stanford Alpaca |
| Evol-Instruct | Iteratively make instructions more complex | WizardLM |
| Distillation | Larger model generates high-quality outputs | Phi-3, Orca-2 |
| Persona-based | Assign personas for diverse responses | Persona Hub |
| Back-translation | Generate code → describe it, or vice versa | Code training |
| Rejection sampling | Generate many, keep only the best | DeepSeek-Math |
| Constitutional AI | Model critiques and revises its own outputs | Anthropic's Claude |
SELF-INSTRUCT PIPELINE:
1. Start with 175 seed instructions (human-written)
2. Feed to LLM:
"Generate a new instruction similar to these examples:"
→ LLM generates: "Write a function to reverse a linked list"
3. Feed instruction back to LLM:
"Complete this instruction:"
→ LLM generates response
4. Filter bad/duplicate examples
5. Add good ones to the dataset
6. Repeat → 52K instructions from 175 seeds!
EVOL-INSTRUCT (making instructions harder):
Simple: "Sort a list of numbers"
↓ evolve
Medium: "Sort a list of numbers using merge sort, explain complexity"
↓ evolve
Complex: "Implement an in-place merge sort that handles duplicates,
runs in O(n log n), and explain the space-time trade-offs
compared to quicksort"
Data Quality Pipeline¶
RAW DATA (internet, books, code)
│
▼
┌─────────────────────────────────────────┐
│ 1. DEDUPLICATION │
│ Remove near-duplicates (MinHash) │
│ → Removes 30-50% of web data │
├─────────────────────────────────────────┤
│ 2. QUALITY FILTERING │
│ - Perplexity filter (remove gibberish)
│ - Language detection │
│ - Document length filters │
│ - Classifier-based quality scoring │
│ → Another 30-40% removed │
├─────────────────────────────────────────┤
│ 3. CONTENT FILTERING │
│ - PII removal │
│ - Toxic content removal │
│ - Copyright/licensed content │
│ - Benchmark contamination check │
├─────────────────────────────────────────┤
│ 4. DATA MIXING │
│ - Balance domains (code, text, math)│
│ - Upsample high-quality sources │
│ - Control language distribution │
├─────────────────────────────────────────┤
│ 5. TOKENIZATION │
│ - Apply BPE/SentencePiece │
│ - Pack into training sequences │
│ - Shuffle │
└─────────────────────────────┬───────────┘
│
▼
CLEAN TRAINING DATA
(typically 1-15T tokens)
Training Data Formats¶
═══ ALPACA FORMAT (simple) ═══
{
"instruction": "Explain the difference between TCP and UDP",
"input": "",
"output": "TCP is a connection-oriented protocol..."
}
═══ SHAREGPT FORMAT (conversational, most popular) ═══
{
"conversations": [
{"from": "human", "value": "What is machine learning?"},
{"from": "gpt", "value": "Machine learning is..."},
{"from": "human", "value": "Can you give an example?"},
{"from": "gpt", "value": "Sure! Consider spam detection..."}
]
}
═══ CHATML FORMAT (OpenAI-style) ═══
<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is machine learning?
<|im_end|>
<|im_start|>assistant
Machine learning is...
<|im_end|>
═══ FUNCTION CALLING FORMAT ═══
{
"messages": [
{"role": "user", "content": "Weather in Tokyo?"},
{"role": "assistant", "tool_calls": [
{"function": {"name": "get_weather", "arguments": "{\"city\":\"Tokyo\"}"}}
]},
{"role": "tool", "content": "22°C, partly cloudy"},
{"role": "assistant", "content": "It's 22°C and partly cloudy in Tokyo."}
]
}
The Model Collapse Problem¶
MODEL COLLAPSE:
When AI-generated data is used to train more AI,
which generates data for more AI... quality degrades.
Real data → Model A → Synthetic data → Model B →
Synthetic data → Model C → ... → garbage
Like photocopying a photocopy of a photocopy.
PREVENTION:
1. Always mix synthetic + real data (never 100% synthetic)
2. Use strong quality filtering
3. Maintain diversity (don't let one pattern dominate)
4. Use the BEST teacher model (not a distilled version)
5. Validate against held-out human-written benchmarks
◆ Quick Reference¶
DATA SOURCES FOR FINE-TUNING:
Open datasets: HuggingFace Hub, OpenHermes, SlimOrca
Self-generated: Self-Instruct with GPT-4/Claude
Proprietary: Company emails, docs, support tickets
Augmented: Rephrase existing data, add edge cases
DATASET SIZE GUIDELINES:
Quick LoRA fine-tune: 1K-10K examples
Solid domain adapter: 10K-100K examples
Pre-training: 1T-15T tokens (massive!)
FORMAT CHOICE:
Simple tasks (Q&A) → Alpaca
Conversations (chat) → ShareGPT
Tool-using models → Function calling format
Reasoning models → Include chain-of-thought
QUALITY > QUANTITY:
1,000 high-quality examples > 100,000 noisy ones
Always filter, always validate, always verify
○ Gotchas & Common Mistakes¶
- ⚠️ ToS violations: Using GPT-4/Claude outputs to train competitor models may violate terms. Use open models for open training.
- ⚠️ Benchmark contamination: If training data leaks test data, benchmarks become meaningless. Always check for leakage.
- ⚠️ Format mismatches: If you train on Alpaca format but deploy with ChatML format, performance drops. Match formats.
- ⚠️ Overrepresentation bias: If 80% of your data is "helpful assistant" style, the model becomes a generic assistant regardless of fine-tuning goal.
○ Interview Angles¶
- Q: How would you train a domain-specific LLM?
-
A: (1) Collect domain documents, (2) Generate synthetic instruction-response pairs using a teacher model, (3) Quality-filter using domain experts + LLM-as-judge, (4) Format in ShareGPT/ChatML, (5) Fine-tune with LoRA/QLoRA, (6) Evaluate against domain-specific benchmarks.
-
Q: What's the risk of training on synthetic data?
- A: Model collapse — progressive quality degradation across generations. Also bias amplification (synthetic data inherits teacher's biases) and benchmark contamination. Mitigate by mixing with real data, strong quality filtering, and using diverse teacher models.
★ Code & Implementation¶
Synthetic Instruction Data Generator¶
# pip install openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY env var
from openai import OpenAI
import json
client = OpenAI()
def generate_instruction_dataset(
domain: str,
num_examples: int = 10,
task_types: list[str] | None = None,
) -> list[dict]:
"""Generate synthetic instruction-response pairs for SFT fine-tuning."""
if task_types is None:
task_types = ["explain", "summarize", "compare", "give example", "list steps for"]
prompt = (
f"Generate {num_examples} diverse instruction-response pairs for the domain: '{domain}'.\n"
f"Use these task types: {', '.join(task_types)}.\n"
"Each pair should be high-quality and diverse.\n\n"
"JSON format only:\n"
'[{"instruction": "...", "response": "...", "task_type": "..."}, ...]'
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.9, # high diversity
max_tokens=3000,
response_format={"type": "json_object"},
)
# Parse response (model returns single JSON object with list inside)
raw = json.loads(resp.choices[0].message.content)
# Handle different keys the model might use
for key in ("items", "examples", "pairs", "data"):
if key in raw:
return raw[key]
return list(raw.values())[0] if raw else []
# Generate RAG training data
examples = generate_instruction_dataset(
domain="Retrieval-Augmented Generation for enterprise software",
num_examples=5,
)
for ex in examples[:3]:
print(f"[{ex.get('task_type', 'N/A')}] {ex['instruction'][:60]}...")
print(f" → {ex['response'][:80]}...\n")
# Save for fine-tuning
with open("synthetic_sft_data.jsonl", "w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")
print(f"Saved {len(examples)} examples to synthetic_sft_data.jsonl")
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Fine Tuning, Tokenization |
| Leads to | Distillation And Compression, Better models |
| Compare with | Traditional ML data pipelines, Human annotation |
| Cross-domain | Data engineering, ETL pipelines, Data quality |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Model collapse | Synthetic-data-trained model produces repetitive outputs | Training on data from same model family | Mix synthetic with real data (20%+ real), diverse generators |
| Distribution mismatch | Model trained on synthetic data fails on real inputs | Synthetic data doesn't match production distribution | Validate against real data statistics, domain-specific generators |
| Quality amplification | Errors in seed data get amplified through pipeline | No quality filtering on generated data | Multi-stage quality filtering, LLM-as-judge scoring |
| PII leakage | Generated data contains memorized PII from training | Large models memorize training examples | Differential privacy, PII detection on output, canary tokens |
◆ Hands-On Exercises¶
Exercise 1: Generate and Validate a Synthetic Dataset¶
Goal: Create synthetic training data and measure quality vs real data Time: 45 minutes Steps: 1. Take 50 real examples from a classification dataset 2. Use an LLM to generate 200 synthetic examples matching the distribution 3. Train a classifier on synthetic-only vs real-only vs mixed 4. Compare test accuracy across all three Expected Output: Accuracy comparison table showing mixed data performs best
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Peng et al. "Instruction Tuning with GPT-4" (2023) | Foundational approach to synthetic instruction data |
| 📘 Book | "AI Engineering" by Chip Huyen (2025), Ch 4 | Covers synthetic data for evaluation and training |
| 🔧 Hands-on | Argilla Documentation | Platform for data labeling and synthetic data curation |
★ Sources¶
- Wang et al., "Self-Instruct" (2023)
- Xu et al., "WizardLM: Empowering Large Language Models to Follow Complex Instructions" (2023)
- Microsoft, "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" (2024)
- Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget" (2024)
- HuggingFace Datasets — https://huggingface.co/datasets