Fine-Tuning LLMs¶

✨ Bit: Full fine-tuning a 70B model needs ~280GB of GPU memory (14× A100 40GBs). LoRA does it on 1 GPU. That's not an optimization — that's a paradigm shift.

★ TL;DR¶

What: Adapting a pre-trained LLM's weights on your specific data to change its behavior, style, or domain expertise
Why: When prompting isn't enough — you need the model to consistently behave a certain way
Key point: LoRA/QLoRA made fine-tuning accessible. You don't need a GPU cluster anymore.

★ Overview¶

Definition¶

Fine-tuning is the process of continuing to train a pre-trained LLM on a smaller, task-specific dataset to adapt it for specific use cases. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA achieve this by training only a tiny fraction of parameters.

Scope¶

Covers: Full fine-tuning, LoRA, QLoRA, PEFT methods, when to use vs RAG. For RAG as the alternative approach, see Rag. For DPO, GRPO, and other advanced post-training strategies, see Advanced Fine-Tuning for LLM Adaptation.

Significance¶

Bridges gap between general-purpose LLMs and domain-specific needs
LoRA (2021) democratized fine-tuning: any developer with 1 GPU can now customize an LLM
2025-2026 consensus: Hybrid RAG + LoRA is the production gold standard

Prerequisites¶

Llms Overview — what you're fine-tuning
Basic PyTorch / training loop understanding
GPU access (even a single consumer GPU works with QLoRA)

★ Deep Dive¶

Types of Fine-Tuning¶

Fine-Tuning Methods
├── Full Fine-Tuning
│   └── Update ALL parameters (expensive, risk of catastrophic forgetting)
│
├── Parameter-Efficient Fine-Tuning (PEFT)
│   ├── LoRA (Low-Rank Adaptation)     ← most popular
│   ├── QLoRA (Quantized LoRA)         ← most accessible
│   ├── DoRA (Weight-Decomposed LoRA)  ← latest, even better
│   ├── Adapters (insert small modules)
│   └── Prefix Tuning / Prompt Tuning
│
└── Alignment Fine-Tuning
    ├── SFT (Supervised Fine-Tuning)
    ├── RLHF (Reinforcement Learning from Human Feedback)
    ├── DPO (Direct Preference Optimization) ← simpler RLHF alternative
    └── GRPO (Group Relative Policy Optimization) ← latest for reasoning

LoRA: How It Works¶

Core idea: Instead of updating the full weight matrix W (millions/billions of params), decompose the update into two small matrices.

Original:     y = W·x           (W is d×d, e.g. 4096×4096 = 16M params)

LoRA:         y = W·x + B·A·x   (A is d×r, B is r×d, rank r ≈ 8-64)
              Freeze W, only train A and B

Example with rank r=16:
  W: 4096 × 4096 = 16,777,216 params (FROZEN)
  A: 4096 × 16   =    65,536 params  (trainable)
  B: 16 × 4096   =    65,536 params  (trainable)
  Total trainable: 131,072 (0.78% of original!)

┌─────────────────────────────────────────┐
│                LoRA                      │
│                                         │
│  Input x ──► [W (frozen)] ──────┐       │
│     │                           │ ADD   │
│     └──► [A (trainable)] ──►    │ ──► y │
│              [B (trainable)] ───┘       │
│                                         │
│  Original path + low-rank update path   │
└─────────────────────────────────────────┘

QLoRA: LoRA + Quantization¶

QLoRA = Quantize base model to 4-bit + Apply LoRA adapters (16-bit)

Memory comparison for LLaMA 70B:
  Full fine-tuning: ~280 GB  (need 7× A100 80GB)
  LoRA (16-bit):    ~160 GB  (need 4× A100 40GB)
  QLoRA (4-bit):    ~35 GB   (1× A100 40GB or 1× RTX 4090!)

Performance: Within 1-2% of full fine-tuning

Training Data Format¶

// Instruction format (most common)
{
  "instruction": "Summarize the following medical report",
  "input": "[medical report text]",
  "output": "[summary]"
}

// Chat format (for conversational fine-tuning)
{
  "messages": [
    {"role": "system", "content": "You are a medical assistant"},
    {"role": "user", "content": "What does this lab result mean?"},
    {"role": "assistant", "content": "[expected response]"}
  ]
}

// How much data?
// Minimum: ~100 high-quality examples (for style/format changes)
// Good: 1,000-10,000 examples
// Diminishing returns beyond 50,000 for most tasks

Key Hyperparameters¶

Parameter	Typical Value	What It Does
`r` (LoRA rank)	8-64	Higher = more capacity, more memory
`lora_alpha`	16-32	Scaling factor (usually = 2×r)
`lora_dropout`	0.05-0.1	Regularization
`target_modules`	q_proj, v_proj, k_proj, o_proj	Which layers to apply LoRA to
`learning_rate`	1e-4 to 2e-4	Lower than pre-training
`epochs`	1-5	Often just 1-3 is enough
`batch_size`	4-16	Limited by GPU memory

★ Code & Implementation¶

Fine-tuning with QLoRA (Step-by-Step)¶

# pip install transformers>=4.40 peft>=0.10 bitsandbytes>=0.42 trl>=0.8 datasets accelerate
# ⚠️ Last tested: 2026-04 | Requires: GPU with CUDA (RTX 3080+ or A100)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# 2. Load model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-8B")

# 3. Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# 4. Load your dataset
dataset = load_dataset("json", data_files="your_training_data.jsonl")

# 5. Train
training_config = SFTConfig(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    args=training_config,
)

trainer.train()
model.save_pretrained("./my-fine-tuned-model")

◆ Comparison¶

Aspect	Full Fine-Tuning	LoRA	QLoRA	RAG (alternative)
Params updated	All	0.1-1%	0.1-1%	None
GPU memory	Very high	Medium	Low (1 GPU)	N/A
Training data	>10K examples	100-10K	100-10K	Documents (no training)
Changes	Everything	Behavior/style	Behavior/style	Adds knowledge
Knowledge	Baked in (static)	Baked in (static)	Baked in (static)	Dynamic (updatable)
Cost	$$$$$	$$	$	$ (infra only)

◆ Strengths vs Limitations¶

✅ Strengths	❌ Limitations
Permanently changes model behavior	Requires training data curation
Consistent output style/format	Risk of catastrophic forgetting
Lower inference cost (no retrieval)	Knowledge is static (vs RAG's dynamic)
Can improve reasoning for specific domains	Overfitting on small datasets
LoRA adapters are tiny (~10-100 MB)	Still needs GPU for training

○ Gotchas & Common Mistakes¶

⚠️ "Just fine-tune it" is usually wrong: Try prompting and RAG first. Fine-tuning is for behavior, not knowledge.
⚠️ Data quality > data quantity: 100 perfect examples beat 10,000 noisy ones
⚠️ Catastrophic forgetting: The model may forget general capabilities. Use diverse training data and low learning rates.
⚠️ Evaluation is hard: Always hold out a test set. Manual evaluation (vibes check) matters more than loss curves.
⚠️ LoRA rank too high: r=256 doesn't mean better. Start with r=16, increase only if underfitting.

○ Interview Angles¶

Q: When would you fine-tune vs use RAG?
A: Fine-tune for: output format changes, domain-specific reasoning/style, consistent behavior. RAG for: up-to-date knowledge, source attribution, private data access. Best practice in 2026: combine both — LoRA for behavior, RAG for facts.
Q: Explain how LoRA reduces memory requirements.
A: Instead of updating the full d×d weight matrix, LoRA decomposes it into two small matrices of rank r (d×r and r×d). With r=16 on a 4096-dim model, you train 0.78% of parameters. QLoRA goes further by quantizing the frozen base model to 4-bit, reducing memory from ~280GB to ~35GB for a 70B model.

★ Connections¶

Relationship	Topics
Builds on	Llms Overview, Transformers
Leads to	Domain-specific models, Hybrid RAG systems
Compare with	Rag (knowledge injection), Prompt engineering (no training)
Cross-domain	Transfer learning (CV), Adapter methods

★ Fine-Tuning Tooling (2026)¶

Tool	Key Feature	When to Use
Unsloth	2x faster, 70% less memory, free tier	Default choice for LoRA/QLoRA fine-tuning in 2026. Massive open-source adoption.
Axolotl	YAML-based config, multi-GPU, many methods	Complex multi-dataset training, advanced configs
HuggingFace TRL	Official SFTTrainer, RLHF support	When you need RLHF/DPO integration or HF ecosystem
LLaMA Factory	100+ models, WebUI, no code needed	Quick experiments, non-engineers
torchtune	PyTorch-native, Meta official	LLaMA models, when you want low-level control

# ═══ Unsloth Example (2x faster QLoRA) ═══
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-4-scout-bnb-4bit",
    max_seq_length = 2048,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    model, r=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16, lora_dropout=0,
)

# Train with standard HuggingFace Trainer — just faster!

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Catastrophic forgetting	Model loses general capabilities after fine-tuning	Aggressive learning rate, too many epochs on narrow data	Lower LR (1e-5 to 5e-6), early stopping, eval on general benchmarks
Overfitting to format	Model only works with exact training format	Insufficient format diversity in training data	Augment with paraphrases, vary system prompts
Data contamination	Inflated eval metrics don't reflect real performance	Test data leaked into training set	Strict train/test split, deduplication, temporal splits
Reward hacking	High reward scores but poor actual output quality	Misspecified reward function	Human evaluation alongside automated metrics, KL penalty
Training instability	Loss spikes, NaN gradients, divergence	Learning rate too high, batch size too small	Gradient clipping, warmup schedule, data quality audit

◆ Hands-On Exercises¶

Exercise 1: Fine-Tune and Measure Forgetting¶

Goal: Fine-tune a model and quantify capability regression Time: 60 minutes Steps: 1. Take a base model (e.g., distilbert-base) 2. Fine-tune on IMDB sentiment (3 epochs) 3. Evaluate on IMDB test set AND on a general NLI benchmark 4. Compare general capability before and after fine-tuning Expected Output: Accuracy on target task vs regression on general benchmarks

Exercise 2: Debug a Bad Fine-Tune¶

Goal: Diagnose and fix a fine-tuning job that overfits Time: 30 minutes Steps: 1. Run a fine-tune with intentionally bad hyperparams (LR=1e-3, 10 epochs) 2. Plot training loss vs validation loss 3. Identify the overfitting epoch 4. Re-run with corrected LR, early stopping, and dropout Expected Output: Training curves showing overfitting pattern and fix

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Hu et al. "LoRA" (2021)	The foundational parameter-efficient fine-tuning paper
🔧 Hands-on	HuggingFace PEFT Library	Production PEFT implementation with LoRA, QLoRA, etc.
📘 Book	"LLM Engineer's Handbook" by Iusztin & Labonne (2024), Ch 5-6	Practical fine-tuning pipeline guide
🎥 Video	Sebastian Raschka — "LoRA and Fine-Tuning LLMs"	Clear explanation of LoRA mechanics and practical tips

★ Sources¶

Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021)
Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
Liu et al., "DoRA: Weight-Decomposed Low-Rank Adaptation" (2024)
Hugging Face PEFT documentation — https://huggingface.co/docs/peft
Unsloth — https://github.com/unslothai/unsloth
Axolotl — https://github.com/OpenAccess-AI-Collective/axolotl
Sebastian Raschka, "Fine-Tuning LLMs" guide