Model Merging¶

✨ Bit: Fine-tuning gives you specialists. Model merging gives you a generalist built from specialists — at zero extra training cost.

★ TL;DR¶

What: Techniques to combine the weights of multiple fine-tuned models into a single model without additional training
Why: Cheaper than training a multi-task model from scratch; lets you combine domain experts (code + medical + safety) into one deployable model
Key point: Merging is NOT ensembling — you get ONE model at inference time, with no extra latency or memory cost

★ Overview¶

Definition¶

Model merging encompasses weight-space techniques that combine parameters from two or more models — typically fine-tuned from the same base model — into a single unified model. Unlike ensembles (which run multiple models at inference), merged models have identical cost and latency to a single model.

Scope¶

Covers: TIES, DARE, SLERP, Model Soups, Frankenmerging, and the mergekit toolchain. For model size reduction, see Distillation and Compression. For weight modification via training, see Fine-Tuning and Advanced Fine-Tuning.

Significance¶

Standard open-weight technique in 2025-2026: top HuggingFace Open LLM Leaderboard models are typically merges
Cost: Zero GPU-hours for training — merging is a pure weight arithmetic operation
Production use cases: Multi-task deployment, domain adaptation stacking, safety alignment injection
Interview topic: "When would you merge models instead of fine-tuning or distilling?" is a common system design question

Prerequisites¶

Fine-Tuning — understanding what fine-tuned weight deltas represent
Distillation and Compression — alternative model optimization path
Inference Optimization — serving context for merged models

★ Deep Dive¶

Why Merging Works¶

When you fine-tune a base model for different tasks, each resulting model occupies a point in weight space near the base model. Research shows that:

The loss landscape between fine-tuned models is often convex — averaging weights tends to land in a good region
Most weight changes are small — fine-tuning modifies <5% of weights significantly
Different tasks modify different weights — interference is lower than expected

WEIGHT SPACE VISUALIZATION:

         Code Expert ●
                    ╲
                     ╲  ← Merged model lands
                      ●    somewhere in this region
                     ╱
                    ╱
     Medical Expert ●
                    │
                    │
              Base Model ●

Key insight: The path between fine-tuned models
often passes through regions of LOW loss for BOTH tasks.

Core Techniques¶

Technique	Method	When to Use	Quality	Complexity
Model Soups	Simple weight averaging	Same base, same task, different hyperparams	Good	Low
SLERP	Spherical interpolation between 2 models	Blending 2 models with different strengths	Good	Low
TIES	Trim + Elect Sign + Merge	≥2 models with potential sign conflicts	Very Good	Medium
DARE	Drop + Rescale delta params	Sparsifying merges, combines with TIES	Very Good	Medium
Frankenmerging	Layer/block stacking from different models	Experimental, architecture exploration	Variable	High

Model Soups (Weight Averaging)¶

The simplest technique — average the weights of models fine-tuned from the same base.

UNIFORM SOUP:
  merged_weight = (model_A_weight + model_B_weight + model_C_weight) / 3

WEIGHTED SOUP:
  merged_weight = 0.5 * model_A_weight + 0.3 * model_B_weight + 0.2 * model_C_weight

Works best when:
  - All models share the same base (e.g., all fine-tuned from LLaMA 3.2 8B)
  - Models were trained on similar tasks but with different hyperparameters
  - You want a more robust model (reduces variance)

SLERP (Spherical Linear Interpolation)¶

Interpolates between two models along a spherical path rather than a straight line in weight space. Better preserves the magnitude of weight vectors.

LINEAR INTERPOLATION (naive):
  merged = (1 - t) * model_A + t * model_B
  Problem: Can shrink weight magnitudes at t = 0.5

SLERP:
  Ω = arccos(A · B / (|A| × |B|))
  merged = sin((1-t)Ω)/sin(Ω) × A + sin(tΩ)/sin(Ω) × B
  Advantage: Preserves weight vector norms

  t = 0.0 → pure model A
  t = 0.5 → balanced blend
  t = 1.0 → pure model B

Limitation: Only works for EXACTLY 2 models.
For 3+ models, use TIES or DARE.

TIES (Trim, Elect Sign, Merge)¶

Designed to resolve parameter interference when merging multiple models.

THE INTERFERENCE PROBLEM:
  Model A says weight_42 should increase by +0.3
  Model B says weight_42 should decrease by -0.2
  Naive averaging: +0.05 (neither model gets what it wants!)

TIES SOLUTION (3 steps):

  STEP 1 — TRIM: Drop low-magnitude delta parameters
    Delta = fine_tuned_weight - base_weight
    If |delta| < threshold → set to 0
    Removes noise, keeps only significant changes

  STEP 2 — ELECT SIGN: Resolve sign conflicts by majority vote
    If 2 models say "increase" and 1 says "decrease"
    → Elect positive sign, zero out the negative delta

  STEP 3 — MERGE: Average the remaining (trimmed, sign-aligned) deltas
    merged = base_weight + avg(surviving_deltas)

DARE (Drop and Rescale)¶

A sparsification technique that randomly drops most delta parameters and rescales the remainder. Often combined with TIES.

DARE MECHANISM:
  1. Compute delta: Δ = fine_tuned - base
  2. Randomly drop p% of delta parameters (typically p = 90-99%)
  3. Rescale remaining: Δ_surviving = Δ_surviving / (1 - p)
     (Rescaling preserves the expected sum of deltas)

WHY IT WORKS:
  Most fine-tuning deltas are redundant. Dropping 90% and rescaling
  the rest preserves performance while reducing interference when merging.

TYPICAL COMBO:
  DARE (drop_rate=0.9) + TIES = DARE-TIES
  → Best of both: sparsification + sign conflict resolution

Frankenmerging (Layer Stacking)¶

Instead of averaging weights, select specific layers from different models.

MODEL A (good at reasoning):  [L0] [L1] [L2] [L3] [L4] [L5]
MODEL B (good at code):       [L0] [L1] [L2] [L3] [L4] [L5]

FRANKENMERGE:
  [A:L0] [A:L1] [B:L2] [B:L3] [A:L4] [A:L5]
  → Take reasoning layers from A, coding layers from B

CAUTION:
  - Highly experimental, results are unpredictable
  - Works because transformer layers are somewhat modular
  - Requires matching architectures (same hidden dims, heads, etc.)
  - Popularized by the "Goliath" 120B merge on HuggingFace

Decision Guide¶

WHEN TO MERGE vs OTHER APPROACHES:

  Have multiple fine-tuned models from same base?
    YES → Model merging is free — try TIES-DARE first
    NO  → Can you fine-tune from a common base? If no → Distillation

  Need guaranteed quality for production?
    YES → Fine-tune from scratch on combined data (safer but expensive)
    NO  → Merge is fine for experimentation

  Models fine-tuned on DIFFERENT tasks?
    YES → TIES or DARE-TIES (handles interference)
    NO  → Model Soups or SLERP (models are close in weight space)

  Only 2 models?
    YES → SLERP (best for 2-model blending)
    NO  → TIES or DARE-TIES (handles 3+ models)

  Want to explore exotic architectures?
    YES → Frankenmerge (experimental, high variance)
    NO  → Stick with weight-average methods

★ Code & Implementation¶

Merging with mergekit¶

# pip install mergekit
# ⚠️ Last tested: 2026-04 | Requires: mergekit>=0.0.5, torch>=2.0, GPU recommended

# === mergekit YAML config: TIES-DARE merge of 2 LoRA experts ===
# Save as: merge_config.yaml

merge_method: dare_ties
base_model: meta-llama/Llama-3.2-8B-Instruct  # shared base
parameters:
  weight: 0.5          # equal weight per model
  density: 0.5         # DARE: keep 50% of delta params (drop_rate=0.5)
  normalize: true      # normalize merged weights
slices:
  - sources:
      - model: meta-llama/Llama-3.2-8B-Instruct  # base (reference)
        parameters:
          weight: 0     # base model has zero merge weight
      - model: ./models/code-expert-lora           # fine-tuned for code
        parameters:
          weight: 0.6   # code expert gets slightly more weight
          density: 0.5
      - model: ./models/medical-expert-lora         # fine-tuned for medical QA
        parameters:
          weight: 0.4
          density: 0.5
dtype: bfloat16
tokenizer_source: base  # use base model tokenizer

# Run: mergekit-yaml merge_config.yaml ./merged-model --cuda
# Output: A single model at ./merged-model/ that handles both code and medical tasks

Simple Weight Averaging (Pure PyTorch)¶

# pip install torch>=2.0 safetensors>=0.4
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.0

import torch
from safetensors.torch import load_file, save_file

def merge_models_soup(model_paths: list[str], weights: list[float] = None) -> dict:
    """Merge multiple models via weighted averaging (Model Soups).

    Args:
        model_paths: Paths to safetensors model files
        weights: Per-model weights (default: uniform)

    Returns:
        Merged state dict
    """
    if weights is None:
        weights = [1.0 / len(model_paths)] * len(model_paths)

    assert len(model_paths) == len(weights), "Must have one weight per model"
    assert abs(sum(weights) - 1.0) < 1e-6, "Weights must sum to 1.0"

    # Load first model as template
    merged = load_file(model_paths[0])
    for key in merged:
        merged[key] = merged[key].float() * weights[0]

    # Accumulate weighted contributions from other models
    for path, weight in zip(model_paths[1:], weights[1:]):
        state_dict = load_file(path)
        for key in merged:
            merged[key] += state_dict[key].float() * weight

    # Convert back to bfloat16 for efficient serving
    for key in merged:
        merged[key] = merged[key].to(torch.bfloat16)

    return merged

# Usage
merged_state = merge_models_soup(
    model_paths=[
        "models/code-expert/model.safetensors",
        "models/medical-expert/model.safetensors",
        "models/safety-tuned/model.safetensors",
    ],
    weights=[0.4, 0.3, 0.3],  # code-heavy blend
)
save_file(merged_state, "merged-model/model.safetensors")
print(f"Merged {len(merged_state)} tensors successfully")

◆ Quick Reference¶

TECHNIQUE CHEAT SHEET:

  2 models, simple blend         → SLERP (t=0.3-0.7)
  3+ models, same task           → Model Soups (uniform averaging)
  3+ models, different tasks     → TIES or DARE-TIES
  Experimental layer mixing      → Frankenmerge (⚠️ high variance)
  LoRA adapters specifically     → DARE-TIES via mergekit

MERGEKIT COMMANDS:
  mergekit-yaml config.yaml ./output --cuda      # GPU merge
  mergekit-yaml config.yaml ./output --lazy       # CPU (low memory)
  mergekit-yaml config.yaml ./output --out-shard-size 4G  # control shard size

COMMON HYPERPARAMETERS:
  density (DARE):  0.3-0.7  (lower = more aggressive dropping)
  weight:          0.3-0.7  (relative importance per model)
  trim threshold:  top 20%  (TIES: keep only significant deltas)

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Capability collapse	Merged model loses a skill one parent had	Weight interference — competing deltas cancel out	Use TIES/DARE instead of naive averaging; test each capability post-merge
Safety regression	Merged model bypasses safety training	Safety alignment weights diluted by task-specific merges	Always include safety-tuned model with high weight (≥0.3); red-team test post-merge
Evaluation blindspots	Merge scores well on average benchmarks but fails on specific tasks	Averaging hides per-task regressions	Evaluate on EACH parent's original eval set, not just aggregate benchmarks
Architecture mismatch	Merge crashes or produces garbage	Models have different architectures, vocab sizes, or tokenizers	Only merge models from the same base architecture and tokenizer
LoRA rank mismatch	mergekit fails or produces degraded output	Different LoRA rank/alpha across adapters	Standardize rank/alpha across all LoRA experiments you plan to merge

○ Gotchas & Common Mistakes¶

⚠️ Only merge from the same base: Merging LLaMA with Mistral produces garbage. Models MUST share the same architecture and initialization.
⚠️ Evaluate per-task, not just aggregate: A merge can improve average scores while catastrophically losing one parent's specialty.
⚠️ Tokenizer matters: Use the base model's tokenizer. If any fine-tuned model added special tokens, the merge will have mismatched embeddings.
⚠️ SLERP is for 2 models only: For 3+ models, use TIES or DARE. SLERP doesn't extend to multi-model merges.
⚠️ Merging ≠ knowledge addition: Merging combines what models already know. It cannot teach a merged model NEW knowledge that no parent had.

○ Interview Angles¶

Q: When would you use model merging instead of fine-tuning on combined data?
A: Merging when: (1) you already have multiple fine-tuned models and want to avoid retraining, (2) you don't have access to the original training data, (3) you want to quickly iterate on combinations (merging takes minutes, fine-tuning takes hours). Fine-tuning on combined data when: (1) you need guaranteed quality, (2) you have the data and compute budget, (3) the task combination is complex enough that weight averaging won't capture interactions.
Q: How do you evaluate whether a merge was successful?
A: Three levels. First, run each parent model's original eval suite against the merge — the merge should retain ≥90% of each parent's task-specific performance. Second, run general benchmarks (MMLU-Pro, HumanEval) to ensure no broad capability loss. Third, red-team for safety regressions, especially if one parent was a safety-tuned model. If any parent's capability drops below acceptable threshold, adjust merge weights or switch to TIES/DARE to reduce interference.

◆ Hands-On Exercises¶

Exercise 1: Merge Two LoRA Adapters with mergekit¶

Goal: Create a multi-task model from two LoRA experts Time: 30 minutes Steps: 1. Fine-tune two LoRA adapters from the same base (e.g., one for code, one for summarization) — or download two from HuggingFace 2. Write a mergekit YAML config using dare_ties method 3. Run mergekit-yaml config.yaml ./merged --cuda 4. Evaluate the merged model on both tasks 5. Compare: merged model vs each parent on their respective task Expected Output: A single model that handles both tasks with ≤10% quality degradation per task

★ Connections¶

Relationship	Topics
Builds on	Fine-Tuning, Advanced Fine-Tuning, Distillation and Compression
Leads to	Inference Optimization (serving merged models), LLM Landscape (open-weight ecosystem)
Compare with	Ensemble methods (multiple models at inference), Multi-task fine-tuning (single training run)
Cross-domain	Federated learning (weight aggregation), Neural architecture search

★ Recommended Resources¶

Type	Resource	Why
🔧 Hands-on	mergekit	Industry-standard toolkit for all merge methods
📄 Paper	Yadav et al., "Resolving Interference When Merging Models" (2023)	The TIES paper — foundational reading
📄 Paper	Yu et al., "Language Models are Super Mario" (2023)	The DARE paper — drop and rescale technique
📄 Paper	Wortsman et al., "Model Soups" (2022)	Original work on weight averaging of fine-tuned models
🔧 Hands-on	HuggingFace Open LLM Leaderboard	See which top models are merges

★ Sources¶

Yadav et al., "Resolving Interference When Merging Models" (TIES, NeurIPS 2023)
Yu et al., "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (DARE, 2023)
Wortsman et al., "Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy" (ICML 2022)
mergekit documentation — https://github.com/arcee-ai/mergekit
Fine-Tuning
Advanced Fine-Tuning