Knowledge Distillation & Model Compression¶
✨ Bit: GPT-4 knows a lot, but it's enormous and expensive. Distillation is like a PhD student learning from a professor — the student ends up much smaller but captures most of the professor's knowledge. That's how Phi-3 (3.8B) can compete with models 100x its size.
★ TL;DR¶
- What: Techniques to create smaller, faster, cheaper models that retain the capabilities of larger ones
- Why: You can't run GPT-4 on a phone. But you CAN distill its knowledge into a 7B model that runs anywhere.
- Key point: Distillation transfers "dark knowledge" (soft probabilities, reasoning patterns) from teacher to student, not just the final answers. This produces students far better than training from scratch.
★ Overview¶
Definition¶
Knowledge Distillation (KD): Training a small "student" model to mimic the behavior of a large "teacher" model. The student learns from the teacher's softmax distribution (soft labels) rather than just hard labels.
Model Compression: The umbrella term for making models smaller/faster, including distillation, pruning, quantization, and architecture changes.
Scope¶
Covers distillation and pruning. For quantization (INT4/INT8/FP8), see Inference Optimization. For fine-tuning (LoRA/QLoRA), see Fine Tuning.
Significance¶
- How DeepSeek-R1-Distill-Qwen-32B and Phi-3 models are created
- Critical for edge deployment (mobile, IoT, embedded)
- Reduces inference costs 10-100x while retaining 80-95% quality
- Hot interview topic: "How would you deploy an LLM on device?"
★ Deep Dive¶
The Distillation Framework¶
┌─────────────────────────────────────────────────┐
│ KNOWLEDGE DISTILLATION │
│ │
│ TEACHER (large, expensive) │
│ ┌──────────────────────┐ │
│ │ GPT-4 / R1 / 70B │ │
│ │ Input: "What is AI?"│ │
│ │ Output distribution:│ │
│ │ AI: 0.35 │ ← "Soft labels" │
│ │ ML: 0.25 │ Rich information! │
│ │ robot: 0.15 │ "AI and ML are │
│ │ code: 0.10 │ related" is encoded │
│ │ other: 0.15 │ in these probs. │
│ └──────────┬───────────┘ │
│ │ soft probabilities │
│ ▼ │
│ STUDENT (small, efficient) │
│ ┌──────────────────────┐ │
│ │ 7B / 3B / 1B model │ │
│ │ Learns to match the │ │
│ │ teacher's soft │ │
│ │ distribution, not │ │
│ │ just the right │ │
│ │ answer │ │
│ └──────────────────────┘ │
│ │
│ LOSS = α × KL(teacher_soft, student_soft) │
│ + (1-α) × CrossEntropy(student, labels) │
│ │
│ Temperature T → softens distributions │
│ Higher T → more "dark knowledge" transfer │
└─────────────────────────────────────────────────┘
Types of Distillation¶
| Type | How It Works | Example |
|---|---|---|
| Response-based | Student mimics teacher's output distribution | Classic: soft label matching |
| Feature-based | Student mimics teacher's intermediate representations | Match hidden layer activations |
| Relation-based | Student learns relationships between samples | Contrastive distillation |
| Rationale-based | Teacher generates step-by-step reasoning as training data | DeepSeek-R1 → R1-Distill-Qwen |
| Multi-teacher | Multiple teachers guide one student | Ensemble knowledge transfer |
| Self-distillation | Model teaches itself (larger layers → smaller) | Born-again networks |
Rationale Distillation (Modern LLM Pattern)¶
The most common pattern in 2025-2026:
1. Teacher generates high-quality outputs
Teacher (R1): "Let me think... [reasoning chain] ... Answer: 42"
2. Collect outputs as training data
dataset = [(input, teacher_output) for input in prompts]
3. Fine-tune student on teacher's outputs
Student (7B) trained on (input, reasoning + answer) pairs
This is how:
DeepSeek-R1 → R1-Distill-Qwen-14B, R1-Distill-Llama-70B
GPT-4 → Alpaca/Vicuna (early 2023, simpler version)
GPT-4 → Phi-3 (via synthetic data distillation)
Other Compression Techniques¶
PRUNING: Remove unimportant weights/neurons/layers
Before: ●─●─●─●─● (all connections active)
After: ●─ ─●─ ─● (weak connections removed)
Types:
- Unstructured: Remove individual weights (sparse matrix)
- Structured: Remove entire neurons/heads/layers (faster)
- Width pruning: Fewer neurons per layer
- Depth pruning: Fewer layers (layer dropping)
Results: 20-50% size reduction with <5% quality loss
QUANTIZATION: Reduce number precision
(covered in detail in [Inference Optimization](../inference/inference-optimization.md))
FP32 → FP16 → INT8 → INT4
Each step: ~2x smaller, slight quality trade-off
ARCHITECTURE CHANGES:
- Replace attention with more efficient variants
- Reduce hidden dimensions
- Fewer layers
- Smaller vocabulary
Compression Comparison¶
| Technique | Size Reduction | Speed Gain | Quality Loss | Effort |
|---|---|---|---|---|
| Distillation | 10-50x | 10-50x | 5-20% | High (need teacher data) |
| Pruning | 2-5x | 2-3x | 2-10% | Medium |
| Quantization (INT4) | 4x | 2-3x | 1-5% | Low (post-hoc) |
| LoRA/QLoRA | ~same size | ~same | Tuned for task | Low |
| Combined | 50-200x | 20-100x | 10-25% | High |
Real-World Distillation Examples¶
| Teacher | Student | Size Ratio | Quality Retained |
|---|---|---|---|
| DeepSeek-R1 (671B) | R1-Distill-Qwen-32B | 21x smaller | ~85-90% on reasoning |
| DeepSeek-R1 (671B) | R1-Distill-Qwen-7B | 96x smaller | ~70-80% on reasoning |
| GPT-4 (1.8T est.) | Phi-3 (3.8B) | ~470x smaller | ~75-85% on benchmarks |
| Claude/GPT-4 | Orca-2 (13B) | ~140x smaller | Strong step-by-step reasoning |
◆ Quick Reference¶
DISTILLATION DECISION TREE:
Need to deploy on edge/mobile?
→ Quantize (INT4) + distill to small model
Need reasoning capability in small model?
→ Rationale distillation from o1/R1
Need domain-specific small model?
→ Fine-tune small model on teacher-generated domain data
Need fastest possible inference?
→ Distill + quantize + prune (all three)
KEY INSIGHT:
Distillation ≠ just fine-tuning on outputs.
The soft probability distribution contains MORE information
than hard labels. "AI" at 0.35 and "ML" at 0.25 tells the
student that AI and ML are related. Hard label "AI" doesn't.
○ Gotchas & Common Mistakes¶
- ⚠️ Distilling from API outputs may violate ToS: OpenAI/Anthropic prohibit using their outputs to train competing models. Check terms.
- ⚠️ Not everything transfers: Distillation works best for surface knowledge. Deep reasoning and world knowledge transfer is harder.
- ⚠️ Model collapse risk: Repeated distillation (distilling distilled models) degrades quality. Use the original teacher.
- ⚠️ Temperature matters: Too low T → student only learns top predictions. Too high T → noise. T=2-4 is typical.
○ Interview Angles¶
- Q: How does knowledge distillation work?
-
A: A large "teacher" model's soft probability outputs (including relationships between classes) are used as training targets for a smaller "student" model. The student learns to match the teacher's full output distribution using KL divergence loss, not just the correct answer. This transfers "dark knowledge" — the teacher's implicit understanding of which concepts are similar.
-
Q: How is DeepSeek-R1-Distill created?
- A: DeepSeek-R1 (671B MoE) generates reasoning chains for thousands of problems. These (input, reasoning_chain + answer) pairs become fine-tuning data for smaller models like Qwen-14B. The small model literally learns to REASON like R1 by mimicking its step-by-step thinking.
★ Code & Implementation¶
Knowledge Distillation: Teacher → Student Loss¶
# pip install torch>=2.3 transformers>=4.40
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3
import torch
import torch.nn as nn
import torch.nn.functional as F
def distillation_loss(
student_logits: torch.Tensor,
teacher_logits: torch.Tensor,
labels: torch.Tensor,
temperature: float = 4.0,
alpha: float = 0.7,
) -> torch.Tensor:
"""
Combined distillation + CE loss.
student_logits: (batch, seq, vocab)
teacher_logits: (batch, seq, vocab)
labels: (batch, seq) ground-truth token IDs
temperature: softs the teacher distribution (higher = more information)
alpha: weight of distillation loss (1-alpha = CE weight)
"""
# Soft targets from teacher (temperature scaling)
soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
soft_student = F.log_softmax(student_logits / temperature, dim=-1)
kd_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean") * (temperature ** 2)
# Hard targets from ground truth labels
ce_loss = F.cross_entropy(
student_logits.view(-1, student_logits.size(-1)),
labels.view(-1),
ignore_index=-100,
)
return alpha * kd_loss + (1 - alpha) * ce_loss
# Example shapes (tiny vocab for demo)
batch, seq, vocab = 2, 10, 100
student_logits = torch.randn(batch, seq, vocab)
teacher_logits = torch.randn(batch, seq, vocab)
labels = torch.randint(0, vocab, (batch, seq))
loss = distillation_loss(student_logits, teacher_logits, labels)
print(f"Distillation loss: {loss.item():.4f}")
# GGUF Quantization check (inference only — requires llama.cpp)
# After downloading a GGUF model:
# from llama_cpp import Llama
# llm = Llama(model_path="./model.gguf", n_ctx=2048)
# output = llm("Explain LoRA in one sentence.", max_tokens=80)
# print(output["choices"][0]["text"])
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Fine Tuning, Deep Learning Fundamentals |
| Leads to | Edge AI deployment, Inference Optimization |
| Compare with | Quantization (number precision), Pruning (removing weights) |
| Cross-domain | Transfer learning, Curriculum learning |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Capability cliff | Student model loses specific capabilities while matching aggregate metrics | Distillation data doesn't cover edge cases | Targeted distillation on weak subsets, multi-task distillation |
| Quantization outliers | Quality drops sharply at INT4/INT8 | Activation outliers in certain layers | SmoothQuant, GPTQ per-channel quantization, mixed precision |
| Pruning instability | Structured pruning removes critical attention heads | No importance scoring before pruning | Magnitude + gradient importance scores, iterative pruning |
| Format mismatch | Distilled model can't follow complex instructions | Training data focused on short completions | Include instruction-following examples in distillation dataset |
◆ Hands-On Exercises¶
Exercise 1: Quantize and Benchmark at Multiple Precisions¶
Goal: Quantize a model from FP16 to INT8 to INT4 and benchmark Time: 30 minutes Steps: 1. Load a base model in FP16 2. Quantize with bitsandbytes (8-bit, then 4-bit) 3. Run a standard benchmark at each precision 4. Measure inference speed at each level Expected Output: Quality/speed tradeoff chart across precisions
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Hinton et al. "Distilling Knowledge in Neural Networks" (2015) | The foundational knowledge distillation paper |
| 📄 Paper | Dettmers et al. "GPTQ" (2022) | Post-training quantization for large models |
| 📘 Book | "Efficient Deep Learning" by Menghani (2024) | Comprehensive treatment of compression techniques |
★ Sources¶
- Hinton et al., "Distilling the Knowledge in a Neural Network" (2015) — the original paper
- DeepSeek, "DeepSeek-R1 Distilled Models" (2025)
- Microsoft, "Phi-3 Technical Report" (2024)
- Gou et al., "Knowledge Distillation: A Survey" (2021)