Skip to content

Synthetic Data & Data Engineering for LLMs

Bit: We've read the entire internet. Literally — we ran out of public text data around 2024. So what do we do? We use AI to generate MORE training data for AI. Sounds circular, but it works — if you're careful.


★ TL;DR

  • What: Generating artificial training data using LLMs and curating/filtering real data for training
  • Why: Training data is the MOAT. The quality, format, and diversity of data determine model quality more than architecture or compute.
  • Key point: Models like Phi-3, Orca-2, and OpenHermes were trained almost entirely on synthetic data generated by GPT-4/Claude — and they're shockingly good. Data engineering IS model engineering.

★ Overview

Definition

  • Synthetic data: AI-generated training examples (instruction-response pairs, reasoning chains, code) used to train or fine-tune other models
  • Data engineering for LLMs: The full pipeline of collecting, filtering, deduplicating, formatting, and curating training data

Scope

Covers data generation methods, quality filtering, and format standards. For fine-tuning methods (LoRA, QLoRA), see Fine Tuning. For distillation as a training method, see Distillation And Compression.

Significance

  • "Data is the new model architecture" — Microsoft Phi team
  • $0 data generation budget with open models (no API costs)
  • Understanding data formats is essential for fine-tuning
  • Interview: "How would you create training data for a domain-specific LLM?"

★ Deep Dive

The Data-Centric AI Shift

MODEL-CENTRIC (old):
  Fixed data + better architecture/training = better model

DATA-CENTRIC (now):
  Fixed architecture + better data = better model

EVIDENCE:
  Phi-3 (3.8B) ≈ GPT-3.5 quality  ← HOW?
  → Trained on ~3.3T tokens of heavily filtered + synthetic data
  → Data quality > model size

Synthetic Data Generation Methods

Method How It Works Example
Self-Instruct Model generates instructions + answers from seeds Stanford Alpaca
Evol-Instruct Iteratively make instructions more complex WizardLM
Distillation Larger model generates high-quality outputs Phi-3, Orca-2
Persona-based Assign personas for diverse responses Persona Hub
Back-translation Generate code → describe it, or vice versa Code training
Rejection sampling Generate many, keep only the best DeepSeek-Math
Constitutional AI Model critiques and revises its own outputs Anthropic's Claude
SELF-INSTRUCT PIPELINE:

  1. Start with 175 seed instructions (human-written)

  2. Feed to LLM:
     "Generate a new instruction similar to these examples:"
     → LLM generates: "Write a function to reverse a linked list"

  3. Feed instruction back to LLM:
     "Complete this instruction:"
     → LLM generates response

  4. Filter bad/duplicate examples

  5. Add good ones to the dataset

  6. Repeat → 52K instructions from 175 seeds!


EVOL-INSTRUCT (making instructions harder):

  Simple: "Sort a list of numbers"
       ↓ evolve
  Medium: "Sort a list of numbers using merge sort, explain complexity"
       ↓ evolve
  Complex: "Implement an in-place merge sort that handles duplicates,
            runs in O(n log n), and explain the space-time trade-offs
            compared to quicksort"

Data Quality Pipeline

RAW DATA (internet, books, code)
┌─────────────────────────────────────────┐
│  1. DEDUPLICATION                       │
│     Remove near-duplicates (MinHash)    │
│     → Removes 30-50% of web data       │
├─────────────────────────────────────────┤
│  2. QUALITY FILTERING                   │
│     - Perplexity filter (remove gibberish)
│     - Language detection                │
│     - Document length filters           │
│     - Classifier-based quality scoring  │
│     → Another 30-40% removed           │
├─────────────────────────────────────────┤
│  3. CONTENT FILTERING                   │
│     - PII removal                       │
│     - Toxic content removal             │
│     - Copyright/licensed content        │
│     - Benchmark contamination check     │
├─────────────────────────────────────────┤
│  4. DATA MIXING                         │
│     - Balance domains (code, text, math)│
│     - Upsample high-quality sources     │
│     - Control language distribution     │
├─────────────────────────────────────────┤
│  5. TOKENIZATION                        │
│     - Apply BPE/SentencePiece           │
│     - Pack into training sequences      │
│     - Shuffle                           │
└─────────────────────────────┬───────────┘
                    CLEAN TRAINING DATA
                    (typically 1-15T tokens)

Training Data Formats

═══ ALPACA FORMAT (simple) ═══
{
  "instruction": "Explain the difference between TCP and UDP",
  "input": "",
  "output": "TCP is a connection-oriented protocol..."
}

═══ SHAREGPT FORMAT (conversational, most popular) ═══
{
  "conversations": [
    {"from": "human", "value": "What is machine learning?"},
    {"from": "gpt", "value": "Machine learning is..."},
    {"from": "human", "value": "Can you give an example?"},
    {"from": "gpt", "value": "Sure! Consider spam detection..."}
  ]
}

═══ CHATML FORMAT (OpenAI-style) ═══
<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is machine learning?
<|im_end|>
<|im_start|>assistant
Machine learning is...
<|im_end|>

═══ FUNCTION CALLING FORMAT ═══
{
  "messages": [
    {"role": "user", "content": "Weather in Tokyo?"},
    {"role": "assistant", "tool_calls": [
      {"function": {"name": "get_weather", "arguments": "{\"city\":\"Tokyo\"}"}}
    ]},
    {"role": "tool", "content": "22°C, partly cloudy"},
    {"role": "assistant", "content": "It's 22°C and partly cloudy in Tokyo."}
  ]
}

The Model Collapse Problem

MODEL COLLAPSE:
  When AI-generated data is used to train more AI,
  which generates data for more AI... quality degrades.

  Real data → Model A → Synthetic data → Model B →
  Synthetic data → Model C → ... → garbage

  Like photocopying a photocopy of a photocopy.

PREVENTION:
  1. Always mix synthetic + real data (never 100% synthetic)
  2. Use strong quality filtering
  3. Maintain diversity (don't let one pattern dominate)
  4. Use the BEST teacher model (not a distilled version)
  5. Validate against held-out human-written benchmarks

◆ Quick Reference

DATA SOURCES FOR FINE-TUNING:
  Open datasets:    HuggingFace Hub, OpenHermes, SlimOrca
  Self-generated:   Self-Instruct with GPT-4/Claude
  Proprietary:      Company emails, docs, support tickets
  Augmented:        Rephrase existing data, add edge cases

DATASET SIZE GUIDELINES:
  Quick LoRA fine-tune:     1K-10K examples
  Solid domain adapter:     10K-100K examples
  Pre-training:             1T-15T tokens (massive!)

FORMAT CHOICE:
  Simple tasks (Q&A)        → Alpaca
  Conversations (chat)      → ShareGPT
  Tool-using models         → Function calling format
  Reasoning models          → Include chain-of-thought

QUALITY > QUANTITY:
  1,000 high-quality examples > 100,000 noisy ones
  Always filter, always validate, always verify

○ Gotchas & Common Mistakes

  • ⚠️ ToS violations: Using GPT-4/Claude outputs to train competitor models may violate terms. Use open models for open training.
  • ⚠️ Benchmark contamination: If training data leaks test data, benchmarks become meaningless. Always check for leakage.
  • ⚠️ Format mismatches: If you train on Alpaca format but deploy with ChatML format, performance drops. Match formats.
  • ⚠️ Overrepresentation bias: If 80% of your data is "helpful assistant" style, the model becomes a generic assistant regardless of fine-tuning goal.

○ Interview Angles

  • Q: How would you train a domain-specific LLM?
  • A: (1) Collect domain documents, (2) Generate synthetic instruction-response pairs using a teacher model, (3) Quality-filter using domain experts + LLM-as-judge, (4) Format in ShareGPT/ChatML, (5) Fine-tune with LoRA/QLoRA, (6) Evaluate against domain-specific benchmarks.

  • Q: What's the risk of training on synthetic data?

  • A: Model collapse — progressive quality degradation across generations. Also bias amplification (synthetic data inherits teacher's biases) and benchmark contamination. Mitigate by mixing with real data, strong quality filtering, and using diverse teacher models.

★ Code & Implementation

Synthetic Instruction Data Generator

# pip install openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY env var
from openai import OpenAI
import json

client = OpenAI()

def generate_instruction_dataset(
    domain: str,
    num_examples: int = 10,
    task_types: list[str] | None = None,
) -> list[dict]:
    """Generate synthetic instruction-response pairs for SFT fine-tuning."""
    if task_types is None:
        task_types = ["explain", "summarize", "compare", "give example", "list steps for"]

    prompt = (
        f"Generate {num_examples} diverse instruction-response pairs for the domain: '{domain}'.\n"
        f"Use these task types: {', '.join(task_types)}.\n"
        "Each pair should be high-quality and diverse.\n\n"
        "JSON format only:\n"
        '[{"instruction": "...", "response": "...", "task_type": "..."}, ...]'
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.9,          # high diversity
        max_tokens=3000,
        response_format={"type": "json_object"},
    )
    # Parse response (model returns single JSON object with list inside)
    raw = json.loads(resp.choices[0].message.content)
    # Handle different keys the model might use
    for key in ("items", "examples", "pairs", "data"):
        if key in raw:
            return raw[key]
    return list(raw.values())[0] if raw else []

# Generate RAG training data
examples = generate_instruction_dataset(
    domain="Retrieval-Augmented Generation for enterprise software",
    num_examples=5,
)
for ex in examples[:3]:
    print(f"[{ex.get('task_type', 'N/A')}] {ex['instruction'][:60]}...")
    print(f"  → {ex['response'][:80]}...\n")

# Save for fine-tuning
with open("synthetic_sft_data.jsonl", "w") as f:
    for ex in examples:
        f.write(json.dumps(ex) + "\n")
print(f"Saved {len(examples)} examples to synthetic_sft_data.jsonl")

★ Connections

Relationship Topics
Builds on Fine Tuning, Tokenization
Leads to Distillation And Compression, Better models
Compare with Traditional ML data pipelines, Human annotation
Cross-domain Data engineering, ETL pipelines, Data quality

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Model collapse Synthetic-data-trained model produces repetitive outputs Training on data from same model family Mix synthetic with real data (20%+ real), diverse generators
Distribution mismatch Model trained on synthetic data fails on real inputs Synthetic data doesn't match production distribution Validate against real data statistics, domain-specific generators
Quality amplification Errors in seed data get amplified through pipeline No quality filtering on generated data Multi-stage quality filtering, LLM-as-judge scoring
PII leakage Generated data contains memorized PII from training Large models memorize training examples Differential privacy, PII detection on output, canary tokens

◆ Hands-On Exercises

Exercise 1: Generate and Validate a Synthetic Dataset

Goal: Create synthetic training data and measure quality vs real data Time: 45 minutes Steps: 1. Take 50 real examples from a classification dataset 2. Use an LLM to generate 200 synthetic examples matching the distribution 3. Train a classifier on synthetic-only vs real-only vs mixed 4. Compare test accuracy across all three Expected Output: Accuracy comparison table showing mixed data performs best


Type Resource Why
📄 Paper Peng et al. "Instruction Tuning with GPT-4" (2023) Foundational approach to synthetic instruction data
📘 Book "AI Engineering" by Chip Huyen (2025), Ch 4 Covers synthetic data for evaluation and training
🔧 Hands-on Argilla Documentation Platform for data labeling and synthetic data curation

★ Sources

  • Wang et al., "Self-Instruct" (2023)
  • Xu et al., "WizardLM: Empowering Large Language Models to Follow Complex Instructions" (2023)
  • Microsoft, "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" (2024)
  • Shumailov et al., "The Curse of Recursion: Training on Generated Data Makes Models Forget" (2024)
  • HuggingFace Datasets — https://huggingface.co/datasets