Synthetic Data & Data Engineering for LLMs¶

✨ Bit: We've read the entire internet. Literally — we ran out of public text data around 2024. So what do we do? We use AI to generate MORE training data for AI. Sounds circular, but it works — if you're careful.

★ TL;DR¶

What: Generating artificial training data using LLMs and curating/filtering real data for training
Why: Training data is the MOAT. The quality, format, and diversity of data determine model quality more than architecture or compute.
Key point: Models like Phi-3, Orca-2, and OpenHermes were trained almost entirely on synthetic data generated by GPT-4/Claude — and they're shockingly good. Data engineering IS model engineering.

★ Overview¶

Definition¶

Synthetic data: AI-generated training examples (instruction-response pairs, reasoning chains, code) used to train or fine-tune other models
Data engineering for LLMs: The full pipeline of collecting, filtering, deduplicating, formatting, and curating training data

Scope¶

Covers data generation methods, quality filtering, and format standards. For fine-tuning methods (LoRA, QLoRA), see Fine Tuning. For distillation as a training method, see Distillation And Compression.

Significance¶

"Data is the new model architecture" — Microsoft Phi team
$0 data generation budget with open models (no API costs)
Understanding data formats is essential for fine-tuning
Interview: "How would you create training data for a domain-specific LLM?"

★ Deep Dive¶

The Data-Centric AI Shift¶

MODEL-CENTRIC (old):
  Fixed data + better architecture/training = better model

DATA-CENTRIC (now):
  Fixed architecture + better data = better model

EVIDENCE:
  Phi-3 (3.8B) ≈ GPT-3.5 quality  ← HOW?
  → Trained on ~3.3T tokens of heavily filtered + synthetic data
  → Data quality > model size

Synthetic Data Generation Methods¶

Method	How It Works	Example
Self-Instruct	Model generates instructions + answers from seeds	Stanford Alpaca
Evol-Instruct	Iteratively make instructions more complex	WizardLM
Distillation	Larger model generates high-quality outputs	Phi-3, Orca-2
Persona-based	Assign personas for diverse responses	Persona Hub
Back-translation	Generate code → describe it, or vice versa	Code training
Rejection sampling	Generate many, keep only the best	DeepSeek-Math
Constitutional AI	Model critiques and revises its own outputs	Anthropic's Claude

SELF-INSTRUCT PIPELINE:

  1. Start with 175 seed instructions (human-written)

  2. Feed to LLM:
     "Generate a new instruction similar to these examples:"
     → LLM generates: "Write a function to reverse a linked list"

  3. Feed instruction back to LLM:
     "Complete this instruction:"
     → LLM generates response

  4. Filter bad/duplicate examples

  5. Add good ones to the dataset

  6. Repeat → 52K instructions from 175 seeds!


EVOL-INSTRUCT (making instructions harder):

  Simple: "Sort a list of numbers"
       ↓ evolve
  Medium: "Sort a list of numbers using merge sort, explain complexity"
       ↓ evolve
  Complex: "Implement an in-place merge sort that handles duplicates,
            runs in O(n log n), and explain the space-time trade-offs
            compared to quicksort"

Data Quality Pipeline¶

RAW DATA (internet, books, code)
    │
    ▼
┌─────────────────────────────────────────┐
│  1. DEDUPLICATION                       │
│     Remove near-duplicates (MinHash)    │
│     → Removes 30-50% of web data       │
├─────────────────────────────────────────┤
│  2. QUALITY FILTERING                   │
│     - Perplexity filter (remove gibberish)
│     - Language detection                │
│     - Document length filters           │
│     - Classifier-based quality scoring  │
│     → Another 30-40% removed           │
├─────────────────────────────────────────┤
│  3. CONTENT FILTERING                   │
│     - PII removal                       │
│     - Toxic content removal             │
│     - Copyright/licensed content        │
│     - Benchmark contamination check     │
├─────────────────────────────────────────┤
│  4. DATA MIXING                         │
│     - Balance domains (code, text, math)│
│     - Upsample high-quality sources     │
│     - Control language distribution     │
├─────────────────────────────────────────┤
│  5. TOKENIZATION                        │
│     - Apply BPE/SentencePiece           │
│     - Pack into training sequences      │
│     - Shuffle                           │
└─────────────────────────────┬───────────┘
                              │
                              ▼
                    CLEAN TRAINING DATA
                    (typically 1-15T tokens)

Training Data Formats¶

═══ ALPACA FORMAT (simple) ═══
{
  "instruction": "Explain the difference between TCP and UDP",
  "input": "",
  "output": "TCP is a connection-oriented protocol..."
}

═══ SHAREGPT FORMAT (conversational, most popular) ═══
{
  "conversations": [
    {"from": "human", "value": "What is machine learning?"},
    {"from": "gpt", "value": "Machine learning is..."},
    {"from": "human", "value": "Can you give an example?"},
    {"from": "gpt", "value": "Sure! Consider spam detection..."}
  ]
}

═══ CHATML FORMAT (OpenAI-style) ═══
<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is machine learning?
<|im_end|>
<|im_start|>assistant
Machine learning is...
<|im_end|>

═══ FUNCTION CALLING FORMAT ═══
{
  "messages": [
    {"role": "user", "content": "Weather in Tokyo?"},
    {"role": "assistant", "tool_calls": [
      {"function": {"name": "get_weather", "arguments": "{\"city\":\"Tokyo\"}"}}
    ]},
    {"role": "tool", "content": "22°C, partly cloudy"},
    {"role": "assistant", "content": "It's 22°C and partly cloudy in Tokyo."}
  ]
}

The Model Collapse Problem¶

MODEL COLLAPSE:
  When AI-generated data is used to train more AI,
  which generates data for more AI... quality degrades.

  Real data → Model A → Synthetic data → Model B →
  Synthetic data → Model C → ... → garbage

  Like photocopying a photocopy of a photocopy.

PREVENTION:
  1. Always mix synthetic + real data (never 100% synthetic)
  2. Use strong quality filtering
  3. Maintain diversity (don't let one pattern dominate)
  4. Use the BEST teacher model (not a distilled version)
  5. Validate against held-out human-written benchmarks

◆ Quick Reference¶

DATA SOURCES FOR FINE-TUNING:
  Open datasets:    HuggingFace Hub, OpenHermes, SlimOrca
  Self-generated:   Self-Instruct with GPT-4/Claude
  Proprietary:      Company emails, docs, support tickets
  Augmented:        Rephrase existing data, add edge cases

DATASET SIZE GUIDELINES:
  Quick LoRA fine-tune:     1K-10K examples
  Solid domain adapter:     10K-100K examples
  Pre-training:             1T-15T tokens (massive!)

FORMAT CHOICE:
  Simple tasks (Q&A)        → Alpaca
  Conversations (chat)      → ShareGPT
  Tool-using models         → Function calling format
  Reasoning models          → Include chain-of-thought

QUALITY > QUANTITY:
  1,000 high-quality examples > 100,000 noisy ones
  Always filter, always validate, always verify

○ Gotchas & Common Mistakes¶

⚠️ ToS violations: Using GPT-4/Claude outputs to train competitor models may violate terms. Use open models for open training.
⚠️ Benchmark contamination: If training data leaks test data, benchmarks become meaningless. Always check for leakage.
⚠️ Format mismatches: If you train on Alpaca format but deploy with ChatML format, performance drops. Match formats.
⚠️ Overrepresentation bias: If 80% of your data is "helpful assistant" style, the model becomes a generic assistant regardless of fine-tuning goal.

○ Interview Angles¶

Q: How would you train a domain-specific LLM?
A: (1) Collect domain documents, (2) Generate synthetic instruction-response pairs using a teacher model, (3) Quality-filter using domain experts + LLM-as-judge, (4) Format in ShareGPT/ChatML, (5) Fine-tune with LoRA/QLoRA, (6) Evaluate against domain-specific benchmarks.
Q: What's the risk of training on synthetic data?
A: Model collapse — progressive quality degradation across generations. Also bias amplification (synthetic data inherits teacher's biases) and benchmark contamination. Mitigate by mixing with real data, strong quality filtering, and using diverse teacher models.

★ Code & Implementation¶

Synthetic Instruction Data Generator¶

# pip install openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY env var
from openai import OpenAI
import json

client = OpenAI()

def generate_instruction_dataset(
    domain: str,
    num_examples: int = 10,
    task_types: list[str] | None = None,
) -> list[dict]:
    """Generate synthetic instruction-response pairs for SFT fine-tuning."""
    if task_types is None:
        task_types = ["explain", "summarize", "compare", "give example", "list steps for"]

    prompt = (
        f"Generate {num_examples} diverse instruction-response pairs for the domain: '{domain}'.\n"
        f"Use these task types: {', '.join(task_types)}.\n"
        "Each pair should be high-quality and diverse.\n\n"
        "JSON format only:\n"
        '[{"instruction": "...", "response": "...", "task_type": "..."}, ...]'
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.9,          # high diversity
        max_tokens=3000,
        response_format={"type": "json_object"},
    )
    # Parse response (model returns single JSON object with list inside)
    raw = json.loads(resp.choices[0].message.content)
    # Handle different keys the model might use
    for key in ("items", "examples", "pairs", "data"):
        if key in raw:
            return raw[key]
    return list(raw.values())[0] if raw else []

# Generate RAG training data
examples = generate_instruction_dataset(
    domain="Retrieval-Augmented Generation for enterprise software",
    num_examples=5,
)
for ex in examples[:3]:
    print(f"[{ex.get('task_type', 'N/A')}] {ex['instruction'][:60]}...")
    print(f"  → {ex['response'][:80]}...\n")

# Save for fine-tuning
with open("synthetic_sft_data.jsonl", "w") as f:
    for ex in examples:
        f.write(json.dumps(ex) + "\n")
print(f"Saved {len(examples)} examples to synthetic_sft_data.jsonl")

★ Connections¶

Relationship	Topics
Builds on	Fine Tuning, Tokenization
Leads to	Distillation And Compression, Better models
Compare with	Traditional ML data pipelines, Human annotation
Cross-domain	Data engineering, ETL pipelines, Data quality

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Model collapse	Synthetic-data-trained model produces repetitive outputs	Training on data from same model family	Mix synthetic with real data (20%+ real), diverse generators
Distribution mismatch	Model trained on synthetic data fails on real inputs	Synthetic data doesn't match production distribution	Validate against real data statistics, domain-specific generators
Quality amplification	Errors in seed data get amplified through pipeline	No quality filtering on generated data	Multi-stage quality filtering, LLM-as-judge scoring
PII leakage	Generated data contains memorized PII from training	Large models memorize training examples	Differential privacy, PII detection on output, canary tokens

◆ Hands-On Exercises¶

Exercise 1: Generate and Validate a Synthetic Dataset¶

Goal: Create synthetic training data and measure quality vs real data Time: 45 minutes Steps: 1. Take 50 real examples from a classification dataset 2. Use an LLM to generate 200 synthetic examples matching the distribution 3. Train a classifier on synthetic-only vs real-only vs mixed 4. Compare test accuracy across all three Expected Output: Accuracy comparison table showing mixed data performs best

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Peng et al. "Instruction Tuning with GPT-4" (2023)	Foundational approach to synthetic instruction data
📘 Book	"AI Engineering" by Chip Huyen (2025), Ch 4	Covers synthetic data for evaluation and training
🔧 Hands-on	Argilla Documentation	Platform for data labeling and synthetic data curation

★ Sources¶

Wang et al., "Self-Instruct" (2023)
Xu et al., "WizardLM: Empowering Large Language Models to Follow Complex Instructions" (2023)
Microsoft, "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" (2024)
Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget" (2024)
HuggingFace Datasets — https://huggingface.co/datasets