Skip to content

Knowledge Distillation & Model Compression

Bit: GPT-4 knows a lot, but it's enormous and expensive. Distillation is like a PhD student learning from a professor — the student ends up much smaller but captures most of the professor's knowledge. That's how Phi-3 (3.8B) can compete with models 100x its size.


★ TL;DR

  • What: Techniques to create smaller, faster, cheaper models that retain the capabilities of larger ones
  • Why: You can't run GPT-4 on a phone. But you CAN distill its knowledge into a 7B model that runs anywhere.
  • Key point: Distillation transfers "dark knowledge" (soft probabilities, reasoning patterns) from teacher to student, not just the final answers. This produces students far better than training from scratch.

★ Overview

Definition

Knowledge Distillation (KD): Training a small "student" model to mimic the behavior of a large "teacher" model. The student learns from the teacher's softmax distribution (soft labels) rather than just hard labels.

Model Compression: The umbrella term for making models smaller/faster, including distillation, pruning, quantization, and architecture changes.

Scope

Covers distillation and pruning. For quantization (INT4/INT8/FP8), see Inference Optimization. For fine-tuning (LoRA/QLoRA), see Fine Tuning.

Significance

  • How DeepSeek-R1-Distill-Qwen-32B and Phi-3 models are created
  • Critical for edge deployment (mobile, IoT, embedded)
  • Reduces inference costs 10-100x while retaining 80-95% quality
  • Hot interview topic: "How would you deploy an LLM on device?"

★ Deep Dive

The Distillation Framework

┌─────────────────────────────────────────────────┐
│            KNOWLEDGE DISTILLATION                │
│                                                 │
│  TEACHER (large, expensive)                      │
│  ┌──────────────────────┐                        │
│  │  GPT-4 / R1 / 70B   │                        │
│  │  Input: "What is AI?"│                        │
│  │  Output distribution:│                        │
│  │    AI:     0.35      │ ← "Soft labels"        │
│  │    ML:     0.25      │    Rich information!    │
│  │    robot:  0.15      │    "AI and ML are       │
│  │    code:   0.10      │     related" is encoded │
│  │    other:  0.15      │     in these probs.     │
│  └──────────┬───────────┘                        │
│             │ soft probabilities                  │
│             ▼                                    │
│  STUDENT (small, efficient)                      │
│  ┌──────────────────────┐                        │
│  │  7B / 3B / 1B model  │                        │
│  │  Learns to match the │                        │
│  │  teacher's soft       │                        │
│  │  distribution, not   │                        │
│  │  just the right      │                        │
│  │  answer              │                        │
│  └──────────────────────┘                        │
│                                                 │
│  LOSS = α × KL(teacher_soft, student_soft)      │
│       + (1-α) × CrossEntropy(student, labels)   │
│                                                 │
│  Temperature T → softens distributions          │
│  Higher T → more "dark knowledge" transfer      │
└─────────────────────────────────────────────────┘

Types of Distillation

Type How It Works Example
Response-based Student mimics teacher's output distribution Classic: soft label matching
Feature-based Student mimics teacher's intermediate representations Match hidden layer activations
Relation-based Student learns relationships between samples Contrastive distillation
Rationale-based Teacher generates step-by-step reasoning as training data DeepSeek-R1 → R1-Distill-Qwen
Multi-teacher Multiple teachers guide one student Ensemble knowledge transfer
Self-distillation Model teaches itself (larger layers → smaller) Born-again networks

Rationale Distillation (Modern LLM Pattern)

The most common pattern in 2025-2026:

  1. Teacher generates high-quality outputs
     Teacher (R1): "Let me think... [reasoning chain] ... Answer: 42"

  2. Collect outputs as training data
     dataset = [(input, teacher_output) for input in prompts]

  3. Fine-tune student on teacher's outputs
     Student (7B) trained on (input, reasoning + answer) pairs

  This is how:
    DeepSeek-R1 → R1-Distill-Qwen-14B, R1-Distill-Llama-70B
    GPT-4 → Alpaca/Vicuna (early 2023, simpler version)
    GPT-4 → Phi-3 (via synthetic data distillation)

Other Compression Techniques

PRUNING: Remove unimportant weights/neurons/layers

  Before:  ●─●─●─●─●    (all connections active)
  After:   ●─ ─●─ ─●    (weak connections removed)

  Types:
  - Unstructured: Remove individual weights (sparse matrix)
  - Structured:   Remove entire neurons/heads/layers (faster)
  - Width pruning: Fewer neurons per layer
  - Depth pruning: Fewer layers (layer dropping)

  Results: 20-50% size reduction with <5% quality loss


QUANTIZATION: Reduce number precision
  (covered in detail in [Inference Optimization](../inference/inference-optimization.md))
  FP32 → FP16 → INT8 → INT4
  Each step: ~2x smaller, slight quality trade-off


ARCHITECTURE CHANGES:
  - Replace attention with more efficient variants
  - Reduce hidden dimensions
  - Fewer layers
  - Smaller vocabulary

Compression Comparison

Technique Size Reduction Speed Gain Quality Loss Effort
Distillation 10-50x 10-50x 5-20% High (need teacher data)
Pruning 2-5x 2-3x 2-10% Medium
Quantization (INT4) 4x 2-3x 1-5% Low (post-hoc)
LoRA/QLoRA ~same size ~same Tuned for task Low
Combined 50-200x 20-100x 10-25% High

Real-World Distillation Examples

Teacher Student Size Ratio Quality Retained
DeepSeek-R1 (671B) R1-Distill-Qwen-32B 21x smaller ~85-90% on reasoning
DeepSeek-R1 (671B) R1-Distill-Qwen-7B 96x smaller ~70-80% on reasoning
GPT-4 (1.8T est.) Phi-3 (3.8B) ~470x smaller ~75-85% on benchmarks
Claude/GPT-4 Orca-2 (13B) ~140x smaller Strong step-by-step reasoning

◆ Quick Reference

DISTILLATION DECISION TREE:
  Need to deploy on edge/mobile?
    → Quantize (INT4) + distill to small model

  Need reasoning capability in small model?
    → Rationale distillation from o1/R1

  Need domain-specific small model?
    → Fine-tune small model on teacher-generated domain data

  Need fastest possible inference?
    → Distill + quantize + prune (all three)

KEY INSIGHT:
  Distillation ≠ just fine-tuning on outputs.
  The soft probability distribution contains MORE information
  than hard labels. "AI" at 0.35 and "ML" at 0.25 tells the
  student that AI and ML are related. Hard label "AI" doesn't.

○ Gotchas & Common Mistakes

  • ⚠️ Distilling from API outputs may violate ToS: OpenAI/Anthropic prohibit using their outputs to train competing models. Check terms.
  • ⚠️ Not everything transfers: Distillation works best for surface knowledge. Deep reasoning and world knowledge transfer is harder.
  • ⚠️ Model collapse risk: Repeated distillation (distilling distilled models) degrades quality. Use the original teacher.
  • ⚠️ Temperature matters: Too low T → student only learns top predictions. Too high T → noise. T=2-4 is typical.

○ Interview Angles

  • Q: How does knowledge distillation work?
  • A: A large "teacher" model's soft probability outputs (including relationships between classes) are used as training targets for a smaller "student" model. The student learns to match the teacher's full output distribution using KL divergence loss, not just the correct answer. This transfers "dark knowledge" — the teacher's implicit understanding of which concepts are similar.

  • Q: How is DeepSeek-R1-Distill created?

  • A: DeepSeek-R1 (671B MoE) generates reasoning chains for thousands of problems. These (input, reasoning_chain + answer) pairs become fine-tuning data for smaller models like Qwen-14B. The small model literally learns to REASON like R1 by mimicking its step-by-step thinking.

★ Code & Implementation

Knowledge Distillation: Teacher → Student Loss

# pip install torch>=2.3 transformers>=4.40
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3
import torch
import torch.nn as nn
import torch.nn.functional as F

def distillation_loss(
    student_logits: torch.Tensor,
    teacher_logits: torch.Tensor,
    labels: torch.Tensor,
    temperature: float = 4.0,
    alpha: float = 0.7,
) -> torch.Tensor:
    """
    Combined distillation + CE loss.

    student_logits: (batch, seq, vocab)
    teacher_logits: (batch, seq, vocab)
    labels:         (batch, seq) ground-truth token IDs
    temperature:    softs the teacher distribution (higher = more information)
    alpha:          weight of distillation loss (1-alpha = CE weight)
    """
    # Soft targets from teacher (temperature scaling)
    soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student = F.log_softmax(student_logits / temperature, dim=-1)
    kd_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean") * (temperature ** 2)

    # Hard targets from ground truth labels
    ce_loss = F.cross_entropy(
        student_logits.view(-1, student_logits.size(-1)),
        labels.view(-1),
        ignore_index=-100,
    )

    return alpha * kd_loss + (1 - alpha) * ce_loss

# Example shapes (tiny vocab for demo)
batch, seq, vocab = 2, 10, 100
student_logits = torch.randn(batch, seq, vocab)
teacher_logits = torch.randn(batch, seq, vocab)
labels         = torch.randint(0, vocab, (batch, seq))

loss = distillation_loss(student_logits, teacher_logits, labels)
print(f"Distillation loss: {loss.item():.4f}")

# GGUF Quantization check (inference only — requires llama.cpp)
# After downloading a GGUF model:
# from llama_cpp import Llama
# llm = Llama(model_path="./model.gguf", n_ctx=2048)
# output = llm("Explain LoRA in one sentence.", max_tokens=80)
# print(output["choices"][0]["text"])

★ Connections

Relationship Topics
Builds on Fine Tuning, Deep Learning Fundamentals
Leads to Edge AI deployment, Inference Optimization
Compare with Quantization (number precision), Pruning (removing weights)
Cross-domain Transfer learning, Curriculum learning

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Capability cliff Student model loses specific capabilities while matching aggregate metrics Distillation data doesn't cover edge cases Targeted distillation on weak subsets, multi-task distillation
Quantization outliers Quality drops sharply at INT4/INT8 Activation outliers in certain layers SmoothQuant, GPTQ per-channel quantization, mixed precision
Pruning instability Structured pruning removes critical attention heads No importance scoring before pruning Magnitude + gradient importance scores, iterative pruning
Format mismatch Distilled model can't follow complex instructions Training data focused on short completions Include instruction-following examples in distillation dataset

◆ Hands-On Exercises

Exercise 1: Quantize and Benchmark at Multiple Precisions

Goal: Quantize a model from FP16 to INT8 to INT4 and benchmark Time: 30 minutes Steps: 1. Load a base model in FP16 2. Quantize with bitsandbytes (8-bit, then 4-bit) 3. Run a standard benchmark at each precision 4. Measure inference speed at each level Expected Output: Quality/speed tradeoff chart across precisions


Type Resource Why
📄 Paper Hinton et al. "Distilling Knowledge in Neural Networks" (2015) The foundational knowledge distillation paper
📄 Paper Dettmers et al. "GPTQ" (2022) Post-training quantization for large models
📘 Book "Efficient Deep Learning" by Menghani (2024) Comprehensive treatment of compression techniques

★ Sources

  • Hinton et al., "Distilling the Knowledge in a Neural Network" (2015) — the original paper
  • DeepSeek, "DeepSeek-R1 Distilled Models" (2025)
  • Microsoft, "Phi-3 Technical Report" (2024)
  • Gou et al., "Knowledge Distillation: A Survey" (2021)