Mechanistic Interpretability¶

✨ Bit: We built AGI-approaching systems and we have NO IDEA what's happening inside them. Mechanistic interpretability is reverse-engineering neural networks — finding the "circuits" that implement specific capabilities. It's neuroscience, but for artificial brains.

★ TL;DR¶

What: Understanding WHAT individual neurons, attention heads, and circuits in neural networks actually do — reverse-engineering the model's internal algorithms
Why: We can't trust what we can't understand. AI safety REQUIRES understanding model internals. Also: Anthropic's biggest research bet.
Key point: Models represent features in "superposition" — a single neuron encodes MANY concepts simultaneously. Sparse autoencoders can extract these hidden features.

★ Overview¶

Definition¶

Mechanistic interpretability (mech-interp) aims to understand neural networks by identifying the specific computations that neurons, attention heads, and circuits perform. It's different from "behavioral" interpretability (what the model does) — mech-interp focuses on HOW it does it internally.

Scope¶

Frontier research. This is not needed for building apps but shows deep technical understanding. For practical safety/alignment, see Ethics Safety Alignment.

Significance¶

Anthropic's primary research direction (largest mech-interp team in the world)
OpenAI's Superalignment team worked on this before dissolution
Frontier AI labs argue this is essential for safe AGI
Demonstrates research-depth in interviews at top AI labs

★ Deep Dive¶

Key Concepts¶

SUPERPOSITION:
  The biggest insight in mech-interp.

  Problem: A model with 768 neurons represents > 768 concepts.
  How? SUPERPOSITION — multiple features share the same neurons.

  Analogy: Storing 1,000 songs on 100 CDs using compression.
  Each CD contributes a little to many songs.

  ┌─────────────────────────────────────────────┐
  │  Neuron 42 responds to:                     │
  │    "Python code"    (activation: 0.7)       │
  │    "Snakes"         (activation: 0.3)       │
  │    "British comedy" (activation: 0.1)       │
  │                                             │
  │  These features are SUPERPOSED in one neuron│
  │  The model uses directions in activation    │
  │  space, not individual neurons, to encode   │
  │  concepts.                                  │
  └─────────────────────────────────────────────┘


FEATURES:
  The actual concepts the model represents internally.
  Not neurons — features are DIRECTIONS in activation space.

  Example features found via sparse autoencoders:
  - "This text is in French"
  - "This number is a year"
  - "The Golden Gate Bridge" (famous Anthropic discovery)
  - "Code contains a bug"
  - "Deceptive behavior" (safety-critical!)


CIRCUITS:
  Connected patterns of features that implement specific behaviors.

  Example: "Indirect Object Identification" circuit
  "Mary gave the book to ___" → [John]

  Attention head A finds "Mary" (subject)
  Attention head B finds "John" (recipient)
  Attention head C copies "John" to output position

  Three heads working together = a circuit

Sparse Autoencoders (SAEs)¶

HOW TO EXTRACT FEATURES FROM SUPERPOSITION:

  Problem: Neurons are polysemantic (respond to many things).
  Solution: Train a sparse autoencoder to decompose activations.

  ┌────────────────────────────────────────────┐
  │  Model activation (768-dim)                │
  │         │                                  │
  │         ▼                                  │
  │  ┌──────────────┐                          │
  │  │ ENCODER       │  768 → 65,536 dimensions│
  │  │ (expand)      │  (overcomplete)         │
  │  └──────┬───────┘                          │
  │         │  + sparsity constraint            │
  │         │  (only ~100 of 65K neurons active)│
  │         ▼                                  │
  │  ┌──────────────┐                          │
  │  │ SPARSE HIDDEN │  Most are ZERO          │
  │  │ FEATURES      │  Active ones =          │
  │  │               │  interpretable features │
  │  └──────┬───────┘                          │
  │         ▼                                  │
  │  ┌──────────────┐                          │
  │  │ DECODER       │  65,536 → 768           │
  │  │ (reconstruct) │  Reconstruct original   │
  │  └──────────────┘                          │
  │                                            │
  │  Training: minimize reconstruction error   │
  │            + sparsity penalty              │
  └────────────────────────────────────────────┘

  Result: Each of the 65K sparse neurons corresponds
  to ONE interpretable feature (ideally).

  Anthropic found ~10 million features in Claude 3 Sonnet!

Research Techniques¶

Technique	What It Does	How
Activation patching	Test if a component is necessary for a behavior	Replace its output, see if behavior changes
Probing	Check if information exists in a layer	Train a linear classifier on activations
Ablation	Remove a component and measure impact	Zero out neurons/heads, check output
Logit lens	See what each layer "thinks" the next token is	Project hidden states directly to vocabulary
Sparse autoencoders	Extract interpretable features	Overcomplete autoencoder with sparsity
Causal tracing	Find where a fact is stored	Corrupt inputs, restore at each layer

Notable Discoveries¶

Discovery	Who	What
Induction heads	Anthropic (2022)	Attention heads that implement in-context learning ("A B ... A → B")
Golden Gate Claude	Anthropic (2024)	Amplifying the "Golden Gate Bridge" feature made Claude obsessed with it
10M features	Anthropic (2024)	Extracted 10M interpretable features from Claude 3 Sonnet via SAEs
ROME	Meng et al. (2022)	Located and edited specific facts in GPT models
Deception features	Anthropic (2024)	Found features that activate when model is being "deceptive"

◆ Quick Reference¶

WHY IT MATTERS FOR SAFETY:
  1. Find deception: detect features that activate when
     model is strategically being dishonest
  2. Understand capabilities: know what model CAN do vs DOES
  3. Controlled editing: surgically modify behavior
  4. Trust: "We can verify the model works as intended"

KEY TERMS:
  Monosemantic    = neuron responds to one concept
  Polysemantic    = neuron responds to many concepts
  Superposition   = features > neurons (compression)
  Circuit         = connected features implementing behavior
  SAE             = sparse autoencoder (feature extraction)
  Logit lens      = what each layer predicts
  Activation patching = causal intervention

RESEARCH GROUPS:
  Anthropic Interpretability team (largest)
  Google DeepMind (mech-interp group)
  EleutherAI (open-source research)
  Independent researchers (Neel Nanda, others)

○ Interview Angles¶

Q: What is superposition in neural networks?
A: Neural networks represent more concepts (features) than they have neurons. Features are encoded as DIRECTIONS in activation space, not individual neurons. Multiple features share the same neurons through superposition — similar to how compressed audio encodes many frequencies in fewer data points. Sparse autoencoders can decompose these back into individual features.
Q: Why does mechanistic interpretability matter for AI safety?
A: We need to understand what models are doing internally — not just what they output. Mech-interp can detect deceptive behavior (features that activate during strategic dishonesty), verify alignment (the model genuinely follows safety training, not just surface compliance), and enable targeted interventions (edit specific behaviors without retraining).

★ Code & Implementation¶

Integrated Gradients (Feature Importance for LLMs)¶

# pip install transformers>=4.40 torch>=2.3 captum>=0.7
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40, captum>=0.7
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from captum.attr import IntegratedGradients

model_id  = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

def predict(input_ids: torch.Tensor) -> torch.Tensor:
    return model(input_ids).logits

text   = "This movie is absolutely fantastic!"
inputs = tokenizer(text, return_tensors="pt")
ids    = inputs["input_ids"]          # (1, seq_len)
tokens = tokenizer.convert_ids_to_tokens(ids[0])

# Baseline: all-PAD tokens (neutral reference)
baseline = torch.zeros_like(ids)

ig = IntegratedGradients(predict)
attrs, delta = ig.attribute(ids, baseline, target=1, return_convergence_delta=True)

# attrs: (1, seq_len) — positive = contributes to POSITIVE class
importance = attrs[0].detach().numpy()
print(f"Convergence delta: {delta.item():.4e}  (should be near 0)")
print("\nToken Attribution Scores:")
for token, score in sorted(zip(tokens, importance), key=lambda x: -abs(x[1])):
    bar = "█" * int(abs(score) * 30)
    sign = "+" if score > 0 else "-"
    print(f"  {token:<15} {sign}{abs(score):.4f}  {bar}")

★ Connections¶

Relationship	Topics
Builds on	Transformers, Neural Networks, Linear Algebra For Ai
Leads to	AI safety, Ethics Safety Alignment, Trustworthy AI
Compare with	Behavioral evaluation (external), Explainable AI (XAI, surface-level)
Cross-domain	Neuroscience, Reverse engineering, Complex systems

◆ Production Failure Modes¶

Failure	Symptoms	Root Cause	Mitigation
Explanation infidelity	Explanation doesn't match actual model reasoning	Post-hoc explanations approximate, not exact	Mechanistic interpretability, circuit-level analysis
Feature attribution noise	Saliency maps highlight irrelevant tokens	Gradient saturation, adversarial sensitivity	Integrated gradients, multiple attribution methods, sanity checks
Interpretability theater	Explanations satisfy auditors but don't reveal real risks	Using simple metrics as proxy for understanding	Causal interventions, ablation studies, not just correlation

◆ Hands-On Exercises¶

Exercise 1: Probe a Model's Internal Representations¶

Goal: Train a linear probe to detect what information a model encodes Time: 45 minutes Steps: 1. Extract hidden states from a language model for 1000 sentences 2. Train a linear classifier to predict sentence properties (sentiment, topic, syntax) 3. Compare probing accuracy at different layers 4. Identify which layers encode which types of information Expected Output: Layer-by-layer probing accuracy chart

★ Recommended Resources¶

Type	Resource	Why
📄 Paper	Olah et al. "Zoom In" (2020)	Beautiful interactive visualization of neural network features
📄 Paper	Anthropic — "Scaling Monosemanticity" (2023)	Extracting interpretable features from language models
🎥 Video	3Blue1Brown — "Neural Networks"	Visual intuition for neural network internals

★ Sources¶

Anthropic, "Scaling Monosemanticity" (2024) — https://transformer-circuits.pub/2024/scaling-monosemanticity/
Anthropic, "A Mathematical Framework for Transformer Circuits" (2021)
Neel Nanda, "200 Concrete Open Problems in Mechanistic Interpretability" (2022)
TransformerLens library — https://github.com/neelnanda-io/TransformerLens
Meng et al., "Locating and Editing Factual Associations in GPT" (ROME, 2022)