Advanced Fine-Tuning for LLM Adaptation¶
✨ Bit: Basic SFT teaches a model what good answers look like. DPO teaches it which of two answers to prefer. GRPO teaches it to improve by comparing its own attempts. The progression is: imitation → preference → self-improvement.
★ TL;DR¶
- What: Fine-tuning techniques beyond standard SFT — preference optimization (DPO, ORPO, KTO), RL-style training (GRPO, PPO), continued pretraining, and efficient training stacks (QLoRA, Unsloth)
- Why: SFT alone cannot teach nuanced preferences, reasoning quality, or consistent tool use. You need preference signals and reward-based training.
- Key point: The algorithm matters less than data quality. DPO with great preference pairs beats GRPO with noisy rewards every time.
★ Overview¶
Scope¶
This note focuses on advanced post-training methods: DPO, ORPO, KTO, GRPO, continued pretraining, and practical accelerators (TRL, Unsloth). For SFT and LoRA foundations, see Fine-Tuning LLMs. For the theoretical RLHF pipeline, see RL Alignment.
When You Need Advanced Fine-Tuning¶
| Signal | What It Means | Technique |
|---|---|---|
| SFT model follows format but gives wrong answers | Need preference signal, not just imitation | DPO / ORPO |
| Model reasons poorly on multi-step problems | Need rollout-based reward training | GRPO |
| Tool use is inconsistent (calls wrong tool 30% of time) | Need negative examples showing wrong tool choices | DPO with tool-use pairs |
| Model is too verbose or refuses too aggressively | Behavior drifted during SFT | DPO with conciseness/helpfulness pairs |
| Domain language is very different (legal, medical) | Model's internal prior needs adaptation | Continued pretraining |
| Limited GPU budget (single GPU, 24GB) | Need memory-efficient training | QLoRA + Unsloth |
Decision Tree¶
Do you have labeled preference pairs (chosen/rejected)?
├── YES → Do you also have reward scores?
│ ├── YES → GRPO or PPO (RL-style, strongest but hardest)
│ └── NO → DPO (simplest preference method)
└── NO → Can you generate them?
├── YES → Use LLM-as-judge to create pairs → DPO
└── NO → Do you have positive examples only?
├── YES → KTO (binary "good/bad" signal, no pairs needed)
└── NO → Start with SFT, collect feedback, return here
Prerequisites¶
- Fine-Tuning LLMs — SFT, LoRA, QLoRA foundations
- RL Alignment — RLHF theory, reward modeling
- Synthetic Data & Data Engineering — creating training data
★ Deep Dive¶
The Post-Training Landscape¶
DIFFICULTY / INFRASTRUCTURE →
┌───────────────────────────────────────────────────────┐
│ │
Low │ SFT KTO DPO ORPO │
│ (imitate) (binary) (pairs) (combined) │
│ │
│ IPO SimPO │
│ (robust) (simple) │
│ │
High │ GRPO PPO │
│ (groups) (full RL) │
│ │
└───────────────────────────────────────────────────────┘
ALIGNMENT QUALITY →
DPO: Direct Preference Optimization¶
The key insight: DPO shows that you can extract the optimal policy directly from preference data without ever training a reward model or running PPO. It reformulates the RLHF objective into a simple classification loss.
DPO Loss Function:
L_DPO(π_θ; π_ref) = -E_(x, y_w, y_l) [
log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))
]
Where:
π_θ = policy model being trained
π_ref = reference model (frozen copy of the base model)
y_w = preferred (winning) response
y_l = rejected (losing) response
x = prompt
β = temperature parameter (controls deviation from reference)
σ = sigmoid function
Intuition: The loss increases the probability of preferred responses
relative to rejected ones, while staying close to the
reference model (controlled by β).
Why β matters: - β too small (< 0.05): Model barely moves from reference — under-training - β too large (> 0.5): Model over-fits to preference signal, forgets general capabilities - Sweet spot: β = 0.1 to 0.3 for most tasks
GRPO: Group Relative Policy Optimization¶
The key insight: Instead of needing a separate reward model (PPO) or explicit preference pairs (DPO), GRPO generates multiple candidate responses for each prompt, scores them with a reward function, and uses the relative ranking within each group to compute the update.
GRPO Process:
1. For each prompt x, generate G candidate responses: {y₁, y₂, ..., y_G}
2. Score each with reward function R: {r₁, r₂, ..., r_G}
3. Normalize scores within the group:
advantage(yᵢ) = (rᵢ - mean(r)) / std(r)
4. Update policy using normalized advantages (like REINFORCE but group-relative)
┌────────────────────────────────────────────────────┐
│ Prompt: "Solve 3x + 7 = 22" │
│ │
│ y₁: "x = 5" ← correct r₁ = 1.0 adv = +1.2 │
│ y₂: "x = 4" ← wrong r₂ = 0.0 adv = -0.8 │
│ y₃: "x = 5" ← correct r₃ = 1.0 adv = +1.2 │
│ y₄: "x = 7" ← wrong r₄ = 0.0 adv = -0.8 │
│ │
│ Policy update: increase P(y₁, y₃), decrease P(y₂, y₄) │
└────────────────────────────────────────────────────┘
Why it's powerful for reasoning:
- Works with verifiable rewards (math correctness, code passes tests)
- No need for human-labeled preference pairs
- Self-improving: model generates its own training signal
- Used by DeepSeek-R1 for reasoning training
Other Preference Methods¶
| Method | What It Does | When to Use | Key Difference from DPO |
|---|---|---|---|
| ORPO | Combines SFT + preference in single loss | When you want to do SFT + alignment in one pass | No reference model needed — simpler setup |
| KTO | Uses binary good/bad signal per response (no pairs) | When you have thumbs up/down data but not paired comparisons | Doesn't require paired chosen/rejected responses |
| IPO | Adds regularization to prevent DPO from overfitting | When DPO training is unstable or overfits | Bounded loss prevents extreme policy shifts |
| SimPO | Length-normalized DPO without reference model | When you want DPO-like results with less memory | No reference model = 50% less GPU memory |
Continued Pretraining vs SFT vs DPO¶
| Choice | What It Changes | Use When | Risk | Cost |
|---|---|---|---|---|
| Continued pretraining | Model's language prior (word distributions, domain patterns) | Domain language very different from base (legal, medical, code) | Catastrophic forgetting, expensive | $$$$ |
| SFT (instruction tuning) | Model's behavior (how it responds to instructions) | Need specific response format, style, or skill | Shallow preference learning | $$ |
| DPO / Preference | Model's preferences (which response is better) | Need nuanced quality judgments, safety alignment | Requires high-quality paired data | $$$ |
| GRPO / RL | Model's reasoning strategy (how it solves problems) | Need improved reasoning, verifiable rewards available | Reward hacking, instability | $$$$ |
The practical training stack (how production teams actually do it):
Step 1: (Optional) Continued Pretraining on Domain Corpus
↓
Step 2: Supervised Fine-Tuning (SFT) on Instruction Data
↓
Step 3: Preference Optimization (DPO/ORPO) on Preference Data
│ ORPO note: trains SFT + alignment in one pass — no reference model
↓
Step 4: (Optional) GRPO/RLVR on Verifiable-Reward Tasks
│ RLVR: uses programmatic rewards (code tests, math checkers) instead
│ of human-ranked preferences — shift from LLM judge to objective rewards
↓
Step 5: Targeted Evaluation (task success, safety, capability regression)
↓
Step 6: Deployment Candidate
The 2026 shift: RLVR (Reinforcement Learning with Verifiable Rewards) is increasingly replacing human preference labels for tasks with objective ground truth. Instead of labeled "chosen vs rejected" pairs, use a reward function that checks correctness: - Math: Does the expression evaluate to the right answer? - Code: Do the unit tests pass? - Structured output: Is the JSON schema valid? - Tool use: Did the tool call succeed?
★ Code & Implementation¶
Complete DPO Training with TRL¶
# pip install trl>=0.12 transformers>=4.45 peft>=0.13 datasets>=3.0 bitsandbytes>=0.44
# ⚠️ Last tested: 2026-04 | Requires: trl>=0.12, transformers>=4.45
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import DPOTrainer, DPOConfig
from datasets import Dataset
# 1. Load base model with 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
)
model_name = "meta-llama/Llama-3.1-8B-Instruct" # Or any base model
model = AutoModelForCausalLM.from_pretrained(
model_name, quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 2. Add LoRA adapters
lora_config = LoraConfig(
r=16, # Rank — higher = more capacity, more memory
lora_alpha=32, # Scaling factor (typically 2× rank)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
print(f"Trainable parameters: {model.print_trainable_parameters()}")
# Expected: ~0.5% of total parameters are trainable
# 3. Prepare preference dataset
# Each example needs: prompt, chosen (preferred response), rejected (worse response)
preference_data = [
{
"prompt": "Explain quantum computing in simple terms.",
"chosen": "Quantum computing uses qubits that can be 0, 1, or both simultaneously (superposition). This lets quantum computers explore many solutions at once, making them powerful for specific problems like optimization and cryptography. Think of it like reading all pages of a book simultaneously instead of one at a time.",
"rejected": "Quantum computing is a revolutionary paradigm that leverages quantum mechanical phenomena such as superposition and entanglement to perform computations that are intractable for classical computers. The fundamental unit of quantum information is the qubit, which exists in a Hilbert space...",
},
# ... more preference pairs (typically 1000-10000 examples)
]
dataset = Dataset.from_list(preference_data)
# 4. Configure DPO training
dpo_config = DPOConfig(
output_dir="./runs/dpo-llama3",
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # Effective batch = 2 × 8 = 16
learning_rate=5e-6, # Lower than SFT (DPO is sensitive)
beta=0.1, # KL penalty strength (start here)
num_train_epochs=1, # DPO often needs only 1-2 epochs
warmup_ratio=0.1,
logging_steps=10,
save_strategy="epoch",
bf16=True, # Use bfloat16 for training stability
gradient_checkpointing=True, # Reduces memory at cost of speed
max_length=1024,
max_prompt_length=512,
)
# 5. Train
trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
# 6. Save the LoRA adapter
trainer.save_model("./dpo-adapter")
# Expected output:
# Training loss decreasing from ~0.7 to ~0.3-0.5 over 1 epoch
# Reward margins (chosen - rejected) increasing from ~0 to ~1-3
# GPU memory usage: ~16GB for 8B model with QLoRA
Preference Dataset Creation with LLM-as-Judge¶
# pip install openai>=1.0
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.0
from openai import OpenAI
import json
client = OpenAI()
def create_preference_pair(prompt: str, response_a: str, response_b: str) -> dict:
"""Use LLM-as-judge to create labeled preference pairs for DPO training."""
judge_prompt = f"""You are a helpful assistant judge. Compare two responses to the same prompt.
PROMPT: {prompt}
RESPONSE A:
{response_a}
RESPONSE B:
{response_b}
Which response is better? Consider: accuracy, helpfulness, conciseness, and safety.
Output JSON: {{"winner": "A" or "B", "reason": "brief explanation"}}"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}],
response_format={"type": "json_object"},
temperature=0,
)
judgment = json.loads(response.choices[0].message.content)
if judgment["winner"] == "A":
return {"prompt": prompt, "chosen": response_a, "rejected": response_b}
else:
return {"prompt": prompt, "chosen": response_b, "rejected": response_a}
# Generate preference pairs at scale
# 1. Collect prompts from your use case
# 2. Generate 2+ responses per prompt (different temperatures, models)
# 3. Use LLM-as-judge to label preferences
# 4. Verify a random sample manually (judge accuracy ~85-90%)
QLoRA with Unsloth (2× Faster, 60% Less Memory)¶
# pip install unsloth
# ⚠️ Last tested: 2026-04 | Requires: unsloth (installs torch, transformers, etc.)
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# 1. Load model with Unsloth (4-bit, optimized)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.1-8B-Instruct",
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=True, # QLoRA
)
# 2. Add LoRA adapters (Unsloth-optimized)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0, # Unsloth recommends 0 for QLoRA
bias="none",
use_gradient_checkpointing="unsloth", # 30% less VRAM
)
# Memory comparison:
# Standard QLoRA (bitsandbytes + PEFT): ~18 GB for Llama-3.1-8B
# Unsloth QLoRA: ~7 GB for Llama-3.1-8B
# Speedup: ~2× faster training
# 3. Train (same TRL interface)
dataset = load_dataset("your-dataset", split="train")
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=SFTConfig(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=1,
output_dir="./runs/unsloth-sft",
bf16=True,
),
)
trainer.train()
# Expected output:
# Training at ~2× speed of standard PEFT
# GPU memory: ~7GB (vs ~18GB standard)
# Same quality as standard QLoRA fine-tuning
◆ Formulas & Equations¶
| Name | Formula | Variables | Use |
|---|---|---|---|
| DPO Loss | $\mathcal{L} = -\log\sigma(\beta \cdot (\log\frac{\pi_\theta(y_w | x)}{\pi_{ref}(y_w | x)} - \log\frac{\pi_\theta(y_l |
| GRPO Advantage | $A(y_i) = \frac{R(y_i) - \mu_G}{\sigma_G}$ | $R$=reward, $\mu_G$=group mean, $\sigma_G$=group std | Normalizing rewards within group |
| LoRA update | $W = W_0 + \alpha \cdot BA$ | $W_0$=frozen, $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$ | Low-rank weight adaptation |
| QLoRA memory | $\text{Memory} \approx \frac{ | W | \times 4}{8} + r \times (d+k) \times 2$ |
◆ Quick Reference¶
TRAINING METHOD CHEAT SHEET:
SFT: "Imitate these good examples"
DPO: "This response is better than that one"
KTO: "This is good / this is bad" (no pairs needed)
ORPO: "SFT + preference in one loss" (no ref model)
GRPO: "Generate many, rank by reward, improve"
PPO: "Full RL with reward model" (most complex)
HYPERPARAMETER STARTING POINTS (DPO):
learning_rate: 5e-6 to 5e-7 (lower than SFT!)
beta: 0.1 (start here, range 0.05-0.5)
epochs: 1-2 (DPO overfits fast)
batch_size: 16-32 effective (with gradient accumulation)
warmup: 10% of steps
GPU MEMORY ESTIMATES (Llama-3.1-8B):
Full fine-tune: ~65 GB (A100 80GB)
LoRA (fp16): ~32 GB (A100 40GB)
QLoRA (4-bit): ~18 GB (RTX 4090)
QLoRA + Unsloth: ~7 GB (RTX 3090)
DATA SIZE GUIDELINES:
SFT: 1K-50K examples
DPO preference pairs: 1K-10K pairs
GRPO: 10K+ prompts (generates own pairs)
Continued pretraining: 100M+ tokens (expensive)
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Reward hacking | Model finds shortcuts to maximize preference signal without actual quality | Reward function has exploitable proxies (e.g., longer = better) | Use multiple evaluation criteria, inspect outputs manually, add length penalty |
| Capability regression | Model gets better at target task but worse at general knowledge | Over-training on narrow preference data, catastrophic forgetting | Run regression test suite, keep training short (1-2 epochs), higher β |
| Verbosity drift | DPO-trained model becomes excessively verbose | Longer responses more often labeled "preferred" in training data | Filter training data for length bias, add length normalization (SimPO) |
| Refusal collapse | Model refuses too aggressively after safety DPO | Safety preference pairs too dominant in training mix | Balance safety pairs (< 10% of dataset), include "helpful but safe" chosen examples |
| Mode collapse on DPO | Model generates near-identical responses regardless of prompt | β too low, model collapsed to single high-reward pattern | Increase β, add KL regularization, diversify training prompts |
| Data poisoning | LLM-as-judge labels are systematically biased | Judge model has its own biases (verbosity, style preferences) | Use multiple judges, validate sample manually, use diverse judge models |
○ Gotchas & Common Mistakes¶
- ⚠️ A fancier algorithm cannot rescue bad data: DPO with perfect preference pairs beats GRPO with noisy rewards. Always invest in data quality first.
- ⚠️ DPO learning rate must be lower than SFT: Using SFT learning rates (2e-4) with DPO causes rapid divergence. Start at 5e-6.
- ⚠️ Don't skip the SFT step: DPO on a non-instruction-tuned model produces garbage. Always SFT first, then DPO.
- ⚠️ β is not optional to tune: The default β=0.1 works OK but is rarely optimal. Run 3-5 experiments varying β from 0.05 to 0.5.
- ⚠️ Always compare to prompt engineering first: If you can get 80% of the quality improvement with a better system prompt, fine-tuning is premature optimization.
○ Interview Angles¶
- Q: Why has DPO become more popular than PPO for alignment?
-
A: DPO reformulates the RLHF objective so that the optimal policy can be extracted directly from preference pairs, without needing a separate reward model or the unstable PPO training loop. This makes it dramatically simpler to implement — you just need ranked pairs of "chosen" and "rejected" responses, a reference model, and a standard classification-like loss. PPO requires training a reward model, running policy rollouts, computing advantages, and maintaining a value function — all of which introduce instability and hyperparameter sensitivity. In practice, DPO achieves comparable alignment quality to PPO with 3-5× less infrastructure complexity. The tradeoff is that DPO is an offline method (it uses a fixed dataset), while PPO can potentially explore and self-improve through online generation. This is where GRPO bridges the gap — it gets PPO-like self-improvement with DPO-like simplicity.
-
Q: When would you choose GRPO over DPO?
-
A: GRPO shines in two scenarios. First, when you have a verifiable reward function — like math problems (answer is correct or not), code generation (tests pass or fail), or structured output (valid JSON or not). DPO needs someone to label which response is "better," but GRPO can generate its own training signal. Second, when you want the model to explore and find better solutions than what's in your training data. DPO is limited to the quality of your preference pairs — the model can only learn to prefer responses already in the dataset. GRPO generates new candidates and improves on them, enabling genuine self-improvement. DeepSeek used GRPO for training reasoning models (DeepSeek-R1) specifically because math reasoning benefits from this verifiable-reward, generate-and-rank approach.
-
Q: How do you prevent capability regression during fine-tuning?
- A: Three practical strategies. First, always maintain a regression test suite that covers core capabilities you care about preserving — general knowledge, instruction following, safety, and language quality. Run this after every training run, not just the final one. Second, keep training short (1-2 epochs for DPO) and use a higher β value (0.2-0.3) to keep the model closer to the reference. Third, use LoRA/QLoRA rather than full fine-tuning — by only modifying a small number of parameters, you inherently limit how much the model can drift from its base capabilities. If you detect regression, you can blend LoRA weights at inference time to find the optimal balance between new capability and preserved performance.
◆ Hands-On Exercises¶
Exercise 1: DPO Training on Preference Data¶
Goal: Train a DPO adapter and compare it to the SFT baseline Time: 90 minutes (including training time) Steps: 1. Pick a small model (Llama-3.1-8B or Phi-3-mini) and a task (summarization, coding, or Q&A) 2. Create 100 preference pairs (use LLM-as-judge to label) 3. Train DPO adapter using the TRL code above 4. Generate responses from both SFT and DPO models on 20 test prompts 5. Evaluate: which model do you (or an LLM judge) prefer? Expected Output: DPO model preferred on 60-75% of test prompts. Training loss ~0.3-0.5.
Exercise 2: β Sensitivity Analysis¶
Goal: Understand how β affects DPO training behavior Time: 60 minutes Steps: 1. Using the same preference dataset, train 3 DPO models: β=0.05, β=0.1, β=0.5 2. For each, measure: training loss, reward margin (chosen - rejected log prob), and generation quality 3. Compare: which β gives best task quality? Which preserves general capability best? 4. Create a recommendation for your use case Expected Output: Table comparing 3 β values across quality, safety, and capability retention metrics
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Fine-Tuning LLMs (SFT + LoRA foundations), RL Alignment (RLHF theory), Synthetic Data |
| Leads to | Distillation & Compression, reasoning model training, tool-use specialization |
| Compare with | Prompt-only adaptation (cheaper, less capable), pure RAG (no model changes), full RLHF/PPO (more complex, potentially more powerful) |
| Cross-domain | Recommender system optimization, offline reinforcement learning, ranking systems |
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Rafailov et al. "DPO: Your Language Model is Secretly a Reward Model" (2023) — Sections 3-4 | The original DPO paper. Section 3 derives the loss; Section 4 shows it matches PPO quality |
| 📄 Paper | Shao et al. "DeepSeekMath: Pushing the Limits of Math Reasoning" (2024) — GRPO section | Introduces GRPO and shows its application to mathematical reasoning |
| 📘 Book | "LLM Engineer's Handbook" by Iusztin & Labonne (2024), Ch 5-6 | Practical guide to fine-tuning pipelines including DPO implementation details |
| 🔧 Hands-on | HuggingFace TRL Documentation — DPO Trainer | Official docs with examples, hyperparameter guidance, and dataset format |
| 🔧 Hands-on | Unsloth Documentation | 2× faster QLoRA fine-tuning with 60% less memory — essential for GPU-constrained training |
| 🎥 Video | Sebastian Raschka — "DPO Explained" | Clear mathematical walk-through of DPO loss derivation and intuition |
| 📄 Paper | Hu et al. "LoRA: Low-Rank Adaptation" (2021) | The foundational paper for parameter-efficient fine-tuning |
★ Sources¶
- Rafailov et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (2023) — https://arxiv.org/abs/2305.18290
- Shao et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning" (2024) — https://arxiv.org/abs/2402.03300
- Hu et al. "LoRA: Low-Rank Adaptation of Large Language Models" (2021) — https://arxiv.org/abs/2106.09685
- Dettmers et al. "QLoRA: Efficient Finetuning of Quantized LLMs" (2023) — https://arxiv.org/abs/2305.14314
- HuggingFace TRL Documentation — https://huggingface.co/docs/trl/
- Unsloth Documentation — https://docs.unsloth.ai/