Skip to content

Model Merging

Bit: Fine-tuning gives you specialists. Model merging gives you a generalist built from specialists — at zero extra training cost.


★ TL;DR

  • What: Techniques to combine the weights of multiple fine-tuned models into a single model without additional training
  • Why: Cheaper than training a multi-task model from scratch; lets you combine domain experts (code + medical + safety) into one deployable model
  • Key point: Merging is NOT ensembling — you get ONE model at inference time, with no extra latency or memory cost

★ Overview

Definition

Model merging encompasses weight-space techniques that combine parameters from two or more models — typically fine-tuned from the same base model — into a single unified model. Unlike ensembles (which run multiple models at inference), merged models have identical cost and latency to a single model.

Scope

Covers: TIES, DARE, SLERP, Model Soups, Frankenmerging, and the mergekit toolchain. For model size reduction, see Distillation and Compression. For weight modification via training, see Fine-Tuning and Advanced Fine-Tuning.

Significance

  • Standard open-weight technique in 2025-2026: top HuggingFace Open LLM Leaderboard models are typically merges
  • Cost: Zero GPU-hours for training — merging is a pure weight arithmetic operation
  • Production use cases: Multi-task deployment, domain adaptation stacking, safety alignment injection
  • Interview topic: "When would you merge models instead of fine-tuning or distilling?" is a common system design question

Prerequisites


★ Deep Dive

Why Merging Works

When you fine-tune a base model for different tasks, each resulting model occupies a point in weight space near the base model. Research shows that:

  1. The loss landscape between fine-tuned models is often convex — averaging weights tends to land in a good region
  2. Most weight changes are small — fine-tuning modifies <5% of weights significantly
  3. Different tasks modify different weights — interference is lower than expected
WEIGHT SPACE VISUALIZATION:

         Code Expert ●
                     ╲  ← Merged model lands
                      ●    somewhere in this region
     Medical Expert ●
              Base Model ●

Key insight: The path between fine-tuned models
often passes through regions of LOW loss for BOTH tasks.

Core Techniques

Technique Method When to Use Quality Complexity
Model Soups Simple weight averaging Same base, same task, different hyperparams Good Low
SLERP Spherical interpolation between 2 models Blending 2 models with different strengths Good Low
TIES Trim + Elect Sign + Merge ≥2 models with potential sign conflicts Very Good Medium
DARE Drop + Rescale delta params Sparsifying merges, combines with TIES Very Good Medium
Frankenmerging Layer/block stacking from different models Experimental, architecture exploration Variable High

Model Soups (Weight Averaging)

The simplest technique — average the weights of models fine-tuned from the same base.

UNIFORM SOUP:
  merged_weight = (model_A_weight + model_B_weight + model_C_weight) / 3

WEIGHTED SOUP:
  merged_weight = 0.5 * model_A_weight + 0.3 * model_B_weight + 0.2 * model_C_weight

Works best when:
  - All models share the same base (e.g., all fine-tuned from LLaMA 3.2 8B)
  - Models were trained on similar tasks but with different hyperparameters
  - You want a more robust model (reduces variance)

SLERP (Spherical Linear Interpolation)

Interpolates between two models along a spherical path rather than a straight line in weight space. Better preserves the magnitude of weight vectors.

LINEAR INTERPOLATION (naive):
  merged = (1 - t) * model_A + t * model_B
  Problem: Can shrink weight magnitudes at t = 0.5

SLERP:
  Ω = arccos(A · B / (|A| × |B|))
  merged = sin((1-t)Ω)/sin(Ω) × A + sin(tΩ)/sin(Ω) × B
  Advantage: Preserves weight vector norms

  t = 0.0 → pure model A
  t = 0.5 → balanced blend
  t = 1.0 → pure model B

Limitation: Only works for EXACTLY 2 models.
For 3+ models, use TIES or DARE.

TIES (Trim, Elect Sign, Merge)

Designed to resolve parameter interference when merging multiple models.

THE INTERFERENCE PROBLEM:
  Model A says weight_42 should increase by +0.3
  Model B says weight_42 should decrease by -0.2
  Naive averaging: +0.05 (neither model gets what it wants!)

TIES SOLUTION (3 steps):

  STEP 1 — TRIM: Drop low-magnitude delta parameters
    Delta = fine_tuned_weight - base_weight
    If |delta| < threshold → set to 0
    Removes noise, keeps only significant changes

  STEP 2 — ELECT SIGN: Resolve sign conflicts by majority vote
    If 2 models say "increase" and 1 says "decrease"
    → Elect positive sign, zero out the negative delta

  STEP 3 — MERGE: Average the remaining (trimmed, sign-aligned) deltas
    merged = base_weight + avg(surviving_deltas)

DARE (Drop and Rescale)

A sparsification technique that randomly drops most delta parameters and rescales the remainder. Often combined with TIES.

DARE MECHANISM:
  1. Compute delta: Δ = fine_tuned - base
  2. Randomly drop p% of delta parameters (typically p = 90-99%)
  3. Rescale remaining: Δ_surviving = Δ_surviving / (1 - p)
     (Rescaling preserves the expected sum of deltas)

WHY IT WORKS:
  Most fine-tuning deltas are redundant. Dropping 90% and rescaling
  the rest preserves performance while reducing interference when merging.

TYPICAL COMBO:
  DARE (drop_rate=0.9) + TIES = DARE-TIES
  → Best of both: sparsification + sign conflict resolution

Frankenmerging (Layer Stacking)

Instead of averaging weights, select specific layers from different models.

MODEL A (good at reasoning):  [L0] [L1] [L2] [L3] [L4] [L5]
MODEL B (good at code):       [L0] [L1] [L2] [L3] [L4] [L5]

FRANKENMERGE:
  [A:L0] [A:L1] [B:L2] [B:L3] [A:L4] [A:L5]
  → Take reasoning layers from A, coding layers from B

CAUTION:
  - Highly experimental, results are unpredictable
  - Works because transformer layers are somewhat modular
  - Requires matching architectures (same hidden dims, heads, etc.)
  - Popularized by the "Goliath" 120B merge on HuggingFace

Decision Guide

WHEN TO MERGE vs OTHER APPROACHES:

  Have multiple fine-tuned models from same base?
    YES → Model merging is free — try TIES-DARE first
    NO  → Can you fine-tune from a common base? If no → Distillation

  Need guaranteed quality for production?
    YES → Fine-tune from scratch on combined data (safer but expensive)
    NO  → Merge is fine for experimentation

  Models fine-tuned on DIFFERENT tasks?
    YES → TIES or DARE-TIES (handles interference)
    NO  → Model Soups or SLERP (models are close in weight space)

  Only 2 models?
    YES → SLERP (best for 2-model blending)
    NO  → TIES or DARE-TIES (handles 3+ models)

  Want to explore exotic architectures?
    YES → Frankenmerge (experimental, high variance)
    NO  → Stick with weight-average methods

★ Code & Implementation

Merging with mergekit

# pip install mergekit
# ⚠️ Last tested: 2026-04 | Requires: mergekit>=0.0.5, torch>=2.0, GPU recommended

# === mergekit YAML config: TIES-DARE merge of 2 LoRA experts ===
# Save as: merge_config.yaml

merge_method: dare_ties
base_model: meta-llama/Llama-3.2-8B-Instruct  # shared base
parameters:
  weight: 0.5          # equal weight per model
  density: 0.5         # DARE: keep 50% of delta params (drop_rate=0.5)
  normalize: true      # normalize merged weights
slices:
  - sources:
      - model: meta-llama/Llama-3.2-8B-Instruct  # base (reference)
        parameters:
          weight: 0     # base model has zero merge weight
      - model: ./models/code-expert-lora           # fine-tuned for code
        parameters:
          weight: 0.6   # code expert gets slightly more weight
          density: 0.5
      - model: ./models/medical-expert-lora         # fine-tuned for medical QA
        parameters:
          weight: 0.4
          density: 0.5
dtype: bfloat16
tokenizer_source: base  # use base model tokenizer

# Run: mergekit-yaml merge_config.yaml ./merged-model --cuda
# Output: A single model at ./merged-model/ that handles both code and medical tasks

Simple Weight Averaging (Pure PyTorch)

# pip install torch>=2.0 safetensors>=0.4
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.0

import torch
from safetensors.torch import load_file, save_file

def merge_models_soup(model_paths: list[str], weights: list[float] = None) -> dict:
    """Merge multiple models via weighted averaging (Model Soups).

    Args:
        model_paths: Paths to safetensors model files
        weights: Per-model weights (default: uniform)

    Returns:
        Merged state dict
    """
    if weights is None:
        weights = [1.0 / len(model_paths)] * len(model_paths)

    assert len(model_paths) == len(weights), "Must have one weight per model"
    assert abs(sum(weights) - 1.0) < 1e-6, "Weights must sum to 1.0"

    # Load first model as template
    merged = load_file(model_paths[0])
    for key in merged:
        merged[key] = merged[key].float() * weights[0]

    # Accumulate weighted contributions from other models
    for path, weight in zip(model_paths[1:], weights[1:]):
        state_dict = load_file(path)
        for key in merged:
            merged[key] += state_dict[key].float() * weight

    # Convert back to bfloat16 for efficient serving
    for key in merged:
        merged[key] = merged[key].to(torch.bfloat16)

    return merged

# Usage
merged_state = merge_models_soup(
    model_paths=[
        "models/code-expert/model.safetensors",
        "models/medical-expert/model.safetensors",
        "models/safety-tuned/model.safetensors",
    ],
    weights=[0.4, 0.3, 0.3],  # code-heavy blend
)
save_file(merged_state, "merged-model/model.safetensors")
print(f"Merged {len(merged_state)} tensors successfully")

◆ Quick Reference

TECHNIQUE CHEAT SHEET:

  2 models, simple blend         → SLERP (t=0.3-0.7)
  3+ models, same task           → Model Soups (uniform averaging)
  3+ models, different tasks     → TIES or DARE-TIES
  Experimental layer mixing      → Frankenmerge (⚠️ high variance)
  LoRA adapters specifically     → DARE-TIES via mergekit

MERGEKIT COMMANDS:
  mergekit-yaml config.yaml ./output --cuda      # GPU merge
  mergekit-yaml config.yaml ./output --lazy       # CPU (low memory)
  mergekit-yaml config.yaml ./output --out-shard-size 4G  # control shard size

COMMON HYPERPARAMETERS:
  density (DARE):  0.3-0.7  (lower = more aggressive dropping)
  weight:          0.3-0.7  (relative importance per model)
  trim threshold:  top 20%  (TIES: keep only significant deltas)

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Capability collapse Merged model loses a skill one parent had Weight interference — competing deltas cancel out Use TIES/DARE instead of naive averaging; test each capability post-merge
Safety regression Merged model bypasses safety training Safety alignment weights diluted by task-specific merges Always include safety-tuned model with high weight (≥0.3); red-team test post-merge
Evaluation blindspots Merge scores well on average benchmarks but fails on specific tasks Averaging hides per-task regressions Evaluate on EACH parent's original eval set, not just aggregate benchmarks
Architecture mismatch Merge crashes or produces garbage Models have different architectures, vocab sizes, or tokenizers Only merge models from the same base architecture and tokenizer
LoRA rank mismatch mergekit fails or produces degraded output Different LoRA rank/alpha across adapters Standardize rank/alpha across all LoRA experiments you plan to merge

○ Gotchas & Common Mistakes

  • ⚠️ Only merge from the same base: Merging LLaMA with Mistral produces garbage. Models MUST share the same architecture and initialization.
  • ⚠️ Evaluate per-task, not just aggregate: A merge can improve average scores while catastrophically losing one parent's specialty.
  • ⚠️ Tokenizer matters: Use the base model's tokenizer. If any fine-tuned model added special tokens, the merge will have mismatched embeddings.
  • ⚠️ SLERP is for 2 models only: For 3+ models, use TIES or DARE. SLERP doesn't extend to multi-model merges.
  • ⚠️ Merging ≠ knowledge addition: Merging combines what models already know. It cannot teach a merged model NEW knowledge that no parent had.

○ Interview Angles

  • Q: When would you use model merging instead of fine-tuning on combined data?
  • A: Merging when: (1) you already have multiple fine-tuned models and want to avoid retraining, (2) you don't have access to the original training data, (3) you want to quickly iterate on combinations (merging takes minutes, fine-tuning takes hours). Fine-tuning on combined data when: (1) you need guaranteed quality, (2) you have the data and compute budget, (3) the task combination is complex enough that weight averaging won't capture interactions.

  • Q: How do you evaluate whether a merge was successful?

  • A: Three levels. First, run each parent model's original eval suite against the merge — the merge should retain ≥90% of each parent's task-specific performance. Second, run general benchmarks (MMLU-Pro, HumanEval) to ensure no broad capability loss. Third, red-team for safety regressions, especially if one parent was a safety-tuned model. If any parent's capability drops below acceptable threshold, adjust merge weights or switch to TIES/DARE to reduce interference.

◆ Hands-On Exercises

Exercise 1: Merge Two LoRA Adapters with mergekit

Goal: Create a multi-task model from two LoRA experts Time: 30 minutes Steps: 1. Fine-tune two LoRA adapters from the same base (e.g., one for code, one for summarization) — or download two from HuggingFace 2. Write a mergekit YAML config using dare_ties method 3. Run mergekit-yaml config.yaml ./merged --cuda 4. Evaluate the merged model on both tasks 5. Compare: merged model vs each parent on their respective task Expected Output: A single model that handles both tasks with ≤10% quality degradation per task


★ Connections

Relationship Topics
Builds on Fine-Tuning, Advanced Fine-Tuning, Distillation and Compression
Leads to Inference Optimization (serving merged models), LLM Landscape (open-weight ecosystem)
Compare with Ensemble methods (multiple models at inference), Multi-task fine-tuning (single training run)
Cross-domain Federated learning (weight aggregation), Neural architecture search

Type Resource Why
🔧 Hands-on mergekit Industry-standard toolkit for all merge methods
📄 Paper Yadav et al., "Resolving Interference When Merging Models" (2023) The TIES paper — foundational reading
📄 Paper Yu et al., "Language Models are Super Mario" (2023) The DARE paper — drop and rescale technique
📄 Paper Wortsman et al., "Model Soups" (2022) Original work on weight averaging of fine-tuned models
🔧 Hands-on HuggingFace Open LLM Leaderboard See which top models are merges

★ Sources

  • Yadav et al., "Resolving Interference When Merging Models" (TIES, NeurIPS 2023)
  • Yu et al., "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (DARE, 2023)
  • Wortsman et al., "Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy" (ICML 2022)
  • mergekit documentation — https://github.com/arcee-ai/mergekit
  • Fine-Tuning
  • Advanced Fine-Tuning