Fine-Tuning LLMs¶
✨ Bit: Full fine-tuning a 70B model needs ~280GB of GPU memory (14× A100 40GBs). LoRA does it on 1 GPU. That's not an optimization — that's a paradigm shift.
★ TL;DR¶
- What: Adapting a pre-trained LLM's weights on your specific data to change its behavior, style, or domain expertise
- Why: When prompting isn't enough — you need the model to consistently behave a certain way
- Key point: LoRA/QLoRA made fine-tuning accessible. You don't need a GPU cluster anymore.
★ Overview¶
Definition¶
Fine-tuning is the process of continuing to train a pre-trained LLM on a smaller, task-specific dataset to adapt it for specific use cases. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA achieve this by training only a tiny fraction of parameters.
Scope¶
Covers: Full fine-tuning, LoRA, QLoRA, PEFT methods, when to use vs RAG. For RAG as the alternative approach, see Rag. For DPO, GRPO, and other advanced post-training strategies, see Advanced Fine-Tuning for LLM Adaptation.
Significance¶
- Bridges gap between general-purpose LLMs and domain-specific needs
- LoRA (2021) democratized fine-tuning: any developer with 1 GPU can now customize an LLM
- 2025-2026 consensus: Hybrid RAG + LoRA is the production gold standard
Prerequisites¶
- Llms Overview — what you're fine-tuning
- Basic PyTorch / training loop understanding
- GPU access (even a single consumer GPU works with QLoRA)
★ Deep Dive¶
Types of Fine-Tuning¶
Fine-Tuning Methods
├── Full Fine-Tuning
│ └── Update ALL parameters (expensive, risk of catastrophic forgetting)
│
├── Parameter-Efficient Fine-Tuning (PEFT)
│ ├── LoRA (Low-Rank Adaptation) ← most popular
│ ├── QLoRA (Quantized LoRA) ← most accessible
│ ├── DoRA (Weight-Decomposed LoRA) ← latest, even better
│ ├── Adapters (insert small modules)
│ └── Prefix Tuning / Prompt Tuning
│
└── Alignment Fine-Tuning
├── SFT (Supervised Fine-Tuning)
├── RLHF (Reinforcement Learning from Human Feedback)
├── DPO (Direct Preference Optimization) ← simpler RLHF alternative
└── GRPO (Group Relative Policy Optimization) ← latest for reasoning
LoRA: How It Works¶
Core idea: Instead of updating the full weight matrix W (millions/billions of params), decompose the update into two small matrices.
Original: y = W·x (W is d×d, e.g. 4096×4096 = 16M params)
LoRA: y = W·x + B·A·x (A is d×r, B is r×d, rank r ≈ 8-64)
Freeze W, only train A and B
Example with rank r=16:
W: 4096 × 4096 = 16,777,216 params (FROZEN)
A: 4096 × 16 = 65,536 params (trainable)
B: 16 × 4096 = 65,536 params (trainable)
Total trainable: 131,072 (0.78% of original!)
┌─────────────────────────────────────────┐
│ LoRA │
│ │
│ Input x ──► [W (frozen)] ──────┐ │
│ │ │ ADD │
│ └──► [A (trainable)] ──► │ ──► y │
│ [B (trainable)] ───┘ │
│ │
│ Original path + low-rank update path │
└─────────────────────────────────────────┘
QLoRA: LoRA + Quantization¶
QLoRA = Quantize base model to 4-bit + Apply LoRA adapters (16-bit)
Memory comparison for LLaMA 70B:
Full fine-tuning: ~280 GB (need 7× A100 80GB)
LoRA (16-bit): ~160 GB (need 4× A100 40GB)
QLoRA (4-bit): ~35 GB (1× A100 40GB or 1× RTX 4090!)
Performance: Within 1-2% of full fine-tuning
Training Data Format¶
// Instruction format (most common)
{
"instruction": "Summarize the following medical report",
"input": "[medical report text]",
"output": "[summary]"
}
// Chat format (for conversational fine-tuning)
{
"messages": [
{"role": "system", "content": "You are a medical assistant"},
{"role": "user", "content": "What does this lab result mean?"},
{"role": "assistant", "content": "[expected response]"}
]
}
// How much data?
// Minimum: ~100 high-quality examples (for style/format changes)
// Good: 1,000-10,000 examples
// Diminishing returns beyond 50,000 for most tasks
Key Hyperparameters¶
| Parameter | Typical Value | What It Does |
|---|---|---|
r (LoRA rank) |
8-64 | Higher = more capacity, more memory |
lora_alpha |
16-32 | Scaling factor (usually = 2×r) |
lora_dropout |
0.05-0.1 | Regularization |
target_modules |
q_proj, v_proj, k_proj, o_proj | Which layers to apply LoRA to |
learning_rate |
1e-4 to 2e-4 | Lower than pre-training |
epochs |
1-5 | Often just 1-3 is enough |
batch_size |
4-16 | Limited by GPU memory |
★ Code & Implementation¶
Fine-tuning with QLoRA (Step-by-Step)¶
# pip install transformers>=4.40 peft>=0.10 bitsandbytes>=0.42 trl>=0.8 datasets accelerate
# ⚠️ Last tested: 2026-04 | Requires: GPU with CUDA (RTX 3080+ or A100)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# 2. Load model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B")
# 3. Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
# 4. Load your dataset
dataset = load_dataset("json", data_files="your_training_data.jsonl")
# 5. Train
training_config = SFTConfig(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
logging_steps=10,
save_strategy="epoch",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
tokenizer=tokenizer,
args=training_config,
)
trainer.train()
model.save_pretrained("./my-fine-tuned-model")
◆ Comparison¶
| Aspect | Full Fine-Tuning | LoRA | QLoRA | RAG (alternative) |
|---|---|---|---|---|
| Params updated | All | 0.1-1% | 0.1-1% | None |
| GPU memory | Very high | Medium | Low (1 GPU) | N/A |
| Training data | >10K examples | 100-10K | 100-10K | Documents (no training) |
| Changes | Everything | Behavior/style | Behavior/style | Adds knowledge |
| Knowledge | Baked in (static) | Baked in (static) | Baked in (static) | Dynamic (updatable) |
| Cost | $$$$$ | $$ | $ | $ (infra only) |
◆ Strengths vs Limitations¶
| ✅ Strengths | ❌ Limitations |
|---|---|
| Permanently changes model behavior | Requires training data curation |
| Consistent output style/format | Risk of catastrophic forgetting |
| Lower inference cost (no retrieval) | Knowledge is static (vs RAG's dynamic) |
| Can improve reasoning for specific domains | Overfitting on small datasets |
| LoRA adapters are tiny (~10-100 MB) | Still needs GPU for training |
○ Gotchas & Common Mistakes¶
- ⚠️ "Just fine-tune it" is usually wrong: Try prompting and RAG first. Fine-tuning is for behavior, not knowledge.
- ⚠️ Data quality > data quantity: 100 perfect examples beat 10,000 noisy ones
- ⚠️ Catastrophic forgetting: The model may forget general capabilities. Use diverse training data and low learning rates.
- ⚠️ Evaluation is hard: Always hold out a test set. Manual evaluation (vibes check) matters more than loss curves.
- ⚠️ LoRA rank too high: r=256 doesn't mean better. Start with r=16, increase only if underfitting.
○ Interview Angles¶
- Q: When would you fine-tune vs use RAG?
-
A: Fine-tune for: output format changes, domain-specific reasoning/style, consistent behavior. RAG for: up-to-date knowledge, source attribution, private data access. Best practice in 2026: combine both — LoRA for behavior, RAG for facts.
-
Q: Explain how LoRA reduces memory requirements.
- A: Instead of updating the full d×d weight matrix, LoRA decomposes it into two small matrices of rank r (d×r and r×d). With r=16 on a 4096-dim model, you train 0.78% of parameters. QLoRA goes further by quantizing the frozen base model to 4-bit, reducing memory from ~280GB to ~35GB for a 70B model.
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Llms Overview, Transformers |
| Leads to | Domain-specific models, Hybrid RAG systems |
| Compare with | Rag (knowledge injection), Prompt engineering (no training) |
| Cross-domain | Transfer learning (CV), Adapter methods |
★ Fine-Tuning Tooling (2026)¶
| Tool | Key Feature | When to Use |
|---|---|---|
| Unsloth | 2x faster, 70% less memory, free tier | Default choice for LoRA/QLoRA fine-tuning in 2026. Massive open-source adoption. |
| Axolotl | YAML-based config, multi-GPU, many methods | Complex multi-dataset training, advanced configs |
| HuggingFace TRL | Official SFTTrainer, RLHF support | When you need RLHF/DPO integration or HF ecosystem |
| LLaMA Factory | 100+ models, WebUI, no code needed | Quick experiments, non-engineers |
| torchtune | PyTorch-native, Meta official | LLaMA models, when you want low-level control |
# ═══ Unsloth Example (2x faster QLoRA) ═══
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-4-scout-bnb-4bit",
max_seq_length = 2048,
load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(
model, r=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16, lora_dropout=0,
)
# Train with standard HuggingFace Trainer — just faster!
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Catastrophic forgetting | Model loses general capabilities after fine-tuning | Aggressive learning rate, too many epochs on narrow data | Lower LR (1e-5 to 5e-6), early stopping, eval on general benchmarks |
| Overfitting to format | Model only works with exact training format | Insufficient format diversity in training data | Augment with paraphrases, vary system prompts |
| Data contamination | Inflated eval metrics don't reflect real performance | Test data leaked into training set | Strict train/test split, deduplication, temporal splits |
| Reward hacking | High reward scores but poor actual output quality | Misspecified reward function | Human evaluation alongside automated metrics, KL penalty |
| Training instability | Loss spikes, NaN gradients, divergence | Learning rate too high, batch size too small | Gradient clipping, warmup schedule, data quality audit |
◆ Hands-On Exercises¶
Exercise 1: Fine-Tune and Measure Forgetting¶
Goal: Fine-tune a model and quantify capability regression Time: 60 minutes Steps: 1. Take a base model (e.g., distilbert-base) 2. Fine-tune on IMDB sentiment (3 epochs) 3. Evaluate on IMDB test set AND on a general NLI benchmark 4. Compare general capability before and after fine-tuning Expected Output: Accuracy on target task vs regression on general benchmarks
Exercise 2: Debug a Bad Fine-Tune¶
Goal: Diagnose and fix a fine-tuning job that overfits Time: 30 minutes Steps: 1. Run a fine-tune with intentionally bad hyperparams (LR=1e-3, 10 epochs) 2. Plot training loss vs validation loss 3. Identify the overfitting epoch 4. Re-run with corrected LR, early stopping, and dropout Expected Output: Training curves showing overfitting pattern and fix
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Hu et al. "LoRA" (2021) | The foundational parameter-efficient fine-tuning paper |
| 🔧 Hands-on | HuggingFace PEFT Library | Production PEFT implementation with LoRA, QLoRA, etc. |
| 📘 Book | "LLM Engineer's Handbook" by Iusztin & Labonne (2024), Ch 5-6 | Practical fine-tuning pipeline guide |
| 🎥 Video | Sebastian Raschka — "LoRA and Fine-Tuning LLMs" | Clear explanation of LoRA mechanics and practical tips |
★ Sources¶
- Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
- Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
- Liu et al., "DoRA: Weight-Decomposed Low-Rank Adaptation" (2024)
- Hugging Face PEFT documentation — https://huggingface.co/docs/peft
- Unsloth — https://github.com/unslothai/unsloth
- Axolotl — https://github.com/OpenAccess-AI-Collective/axolotl
- Sebastian Raschka, "Fine-Tuning LLMs" guide