ML Experiment & Data Management¶

✨ Bit: If you can't answer "which config and data produced this result?" you're not experimenting — you're guessing. Experiment tracking and data versioning are two halves of the reproducibility story. Code versioning without data versioning is only half the picture.

★ TL;DR¶

What: The practices and tools for recording ML runs (parameters, metrics, artifacts) and tracking dataset changes (versions, lineage, snapshots)
Why: ML behavior changes when either code OR data changes. Without tracking both, reproducing results, debugging regressions, and governing models is impossible.
Key point: Track not just the best run, but enough context to explain why it was better — including the exact data version behind it.

★ Overview¶

Definition¶

Experiment tracking is the structured logging of model runs and metadata for reproducibility and comparison. Data versioning is tracking the specific data used in every experiment, enabling exact reproduction of results.

Scope¶

Covers: Experiment tracking discipline, data versioning strategies, tool selection (MLflow, W&B, DVC, LakeFS), GenAI-specific considerations (prompt versions, eval sets, RAG corpora), and production code.

Significance¶

Strong tracking accelerates iteration and debugging
ML governance and audit require lineage from data → model → deployment
Data drift and silent dataset changes are common causes of production regressions
Essential for promotion from notebook work to team-level ML engineering

Prerequisites¶

★ Deep Dive¶

What to Track¶

Category	Items	Examples
Parameters	Run config	learning rate, batch size, prompt version, model name
Metrics	Quality & cost signals	loss, accuracy, eval score, latency, token cost
Artifacts	Outputs	model checkpoints, plots, generated outputs
Environment	Reproducibility context	code version (git SHA), dependency set, hardware
Data lineage	Data state	dataset version, splits, preprocessing steps

What to Version (Data)¶

Item	Examples	Why
Raw data snapshot	Ingestion state, source identifiers	Reproduce from source
Processed dataset	Cleaned, transformed training table	Training input reproducibility
Splits	Train/val/test partition	Fair comparison across experiments
Evaluation sets	Gold prompts, edge-case examples	Regression detection
GenAI-specific	Retrieved corpora, prompt-eval sets, preference data, synthetic datasets	LLM behavior depends on all of these

Why This Matters More in GenAI¶

GenAI teams must track more artifacts than classical ML:

Prompt versions: Different wording changes behavior dramatically
Eval set versions: The benchmark defines what "better" means
RAG corpus snapshots: Re-indexing changes retrieval results
Preference/feedback datasets: RLHF training data evolves
Model routing configs: Which model handles which queries

Tool Landscape¶

Tool	Experiment Tracking	Data Versioning	Best For
MLflow	✅	Partial (model registry)	Open-source, self-hosted
Weights & Biases	✅	Partial (artifacts)	Team collaboration, visualization
DVC	Partial	✅	Git-based data versioning
LakeFS	❌	✅	Data lake versioning at scale
Neptune.ai	✅	Partial	Metadata-rich experiment tracking
Vertex AI / SageMaker	✅	✅	Integrated cloud ML platform

Data Versioning Strategies¶

Strategy	Strength	Best For
Immutable snapshots	Easy reproducibility	Small/medium datasets
Object-store versioning	Practical for large files	Cloud-native teams
Git-like tools (DVC)	Lineage and branching	ML teams using Git workflows
Table formats (Delta/Iceberg)	Schema evolution, time travel	Data engineering teams

★ Code & Implementation¶

MLflow Experiment Tracking¶

# pip install mlflow>=2.10
# ⚠️ Last tested: 2026-04 | Requires: mlflow>=2.10

import mlflow

# Start tracking an experiment
mlflow.set_experiment("rag-retrieval-optimization")

with mlflow.start_run(run_name="bge-small-baseline"):
    # Log parameters
    mlflow.log_param("embedding_model", "BAAI/bge-small-en-v1.5")
    mlflow.log_param("chunk_size", 400)
    mlflow.log_param("chunk_overlap", 50)
    mlflow.log_param("top_k", 5)
    mlflow.log_param("dataset_version", "eval_set_v3")

    # Log metrics
    mlflow.log_metric("recall_at_5", 0.72)
    mlflow.log_metric("precision_at_5", 0.48)
    mlflow.log_metric("mrr", 0.65)
    mlflow.log_metric("avg_latency_ms", 45)

    # Log artifacts
    mlflow.log_artifact("configs/rag_config.yaml")

    print(f"Run ID: {mlflow.active_run().info.run_id}")

# Compare runs programmatically (for CI/CD pipelines)
experiment = mlflow.get_experiment_by_name("rag-retrieval-optimization")
runs = mlflow.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.mrr DESC"],
    max_results=5,
)
print(f"Best MRR: {runs.iloc[0]['metrics.mrr']:.3f} (run: {runs.iloc[0]['run_id'][:8]})")

# Model promotion gate: only promote if MRR improved
best_mrr = runs.iloc[0]["metrics.mrr"]
if best_mrr > 0.70:
    print(f"PROMOTE: MRR {best_mrr:.3f} exceeds threshold 0.70")
else:
    print(f"HOLD: MRR {best_mrr:.3f} below threshold 0.70")
# Expected output: Run tracked with full reproducibility, comparison across 5 runs

DVC Data Versioning¶

# Install: pip install dvc[s3]
# ⚠️ Last tested: 2026-04

# Initialize DVC in your git repo
dvc init

# Track a dataset
dvc add data/eval_set_v3.jsonl
git add data/eval_set_v3.jsonl.dvc .gitignore
git commit -m "Track eval set v3"

# Push data to remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push

# Later: reproduce exact dataset state from any git commit
git checkout abc123    # checkout the commit
dvc pull               # pull the exact data version used in that commit

◆ Quick Reference¶

EXPERIMENT TRACKING CHECKLIST:

  Every run MUST record:
  ✓ Parameters (config, hyperparams, prompt version)
  ✓ Metrics (quality, cost, latency)
  ✓ Data version (dataset ID, split, snapshot hash)
  ✓ Code version (git SHA)
  ✓ Environment (dependencies, hardware)
  ✓ Artifacts (model, outputs, plots)

DATA VERSIONING CHECKLIST:

  ✓ Assign stable dataset identifiers
  ✓ Version evaluation sets just like code
  ✓ Record exact snapshot used by every important run
  ✓ Keep schema and provenance metadata with the data
  ✓ Make rollback and replay possible

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Unreproducible results	Can't recreate a previous model's behavior	Data version not recorded with experiment	Always log dataset version with every run
Silent data drift	Model quality degrades over time	Training or eval data changed without explicit versioning	Immutable snapshots, checksum validation, drift detection
Tracking decay	Team stops logging experiments after initial enthusiasm	Manual tracking is tedious	Automate logging (decorators, callbacks), make it zero-effort
Eval set contamination	Model performs well on benchmarks but poorly in production	Eval set not versioned separately, leaked into training	Strict eval set versioning, separate storage, access controls

○ Gotchas¶

Tracking only the best run hides important learning
Manual metadata entry decays quickly — automate everything
A dashboard without artifact or data lineage is incomplete
Naming folders final_v2_real is not data versioning
Eval datasets are often forgotten even though they're critical

○ Interview Angles¶

Q: What is the minimum metadata you would track for an ML run?
A: Parameters, metrics, code version, data version, artifacts, and environment details. Without that set, comparing or reproducing results is unreliable. For GenAI specifically, I'd also track prompt versions, eval set versions, and token costs.
Q: Why is data versioning essential for ML reproducibility?
A: Because code alone does not determine model behavior. You need the exact dataset state, splits, and lineage to reproduce or explain results. In GenAI, this extends to RAG corpus snapshots, preference data, and evaluation sets.

★ Connections¶

Relationship	Topics
Builds on	CI/CD for ML, Cloud ML Services
Leads to	Model registry, auditability, reproducible training, LLMOps
Compare with	Ad hoc notebook history, code-only versioning
Cross-domain	Data engineering, experiment design, governance

★ Recommended Resources¶

Type	Resource	Why
🔧 Hands-on	Weights & Biases Documentation	Industry-standard experiment tracking platform
🔧 Hands-on	MLflow Documentation	Open-source experiment tracking and model registry
🔧 Hands-on	DVC Documentation	Git-based data and model versioning
🔧 Hands-on	LakeFS Documentation	Git-like versioning for data lakes
📘 Book	"Designing Machine Learning Systems" by Chip Huyen (2022), Ch 4, 6	Data management and experiment tracking in ML workflows

◆ Hands-On Exercises¶

Exercise 1: Set Up End-to-End Experiment Tracking¶

Goal: Track a complete ML experiment with reproducibility Time: 30 minutes Steps: 1. Set up MLflow tracking server locally 2. Train a model with 3 hyperparameter configurations 3. Log params, metrics, artifacts, and environment info for each run 4. Use the MLflow UI to compare runs and select the best Expected Output: MLflow dashboard showing 3 comparable runs with artifact links

★ Sources¶

MLflow Documentation — https://mlflow.org/docs/
Weights & Biases Documentation — https://docs.wandb.ai/
DVC Documentation — https://dvc.org/doc
LakeFS Documentation — https://docs.lakefs.io/
CI/CD for ML