Mechanistic Interpretability¶
✨ Bit: We built AGI-approaching systems and we have NO IDEA what's happening inside them. Mechanistic interpretability is reverse-engineering neural networks — finding the "circuits" that implement specific capabilities. It's neuroscience, but for artificial brains.
★ TL;DR¶
- What: Understanding WHAT individual neurons, attention heads, and circuits in neural networks actually do — reverse-engineering the model's internal algorithms
- Why: We can't trust what we can't understand. AI safety REQUIRES understanding model internals. Also: Anthropic's biggest research bet.
- Key point: Models represent features in "superposition" — a single neuron encodes MANY concepts simultaneously. Sparse autoencoders can extract these hidden features.
★ Overview¶
Definition¶
Mechanistic interpretability (mech-interp) aims to understand neural networks by identifying the specific computations that neurons, attention heads, and circuits perform. It's different from "behavioral" interpretability (what the model does) — mech-interp focuses on HOW it does it internally.
Scope¶
Frontier research. This is not needed for building apps but shows deep technical understanding. For practical safety/alignment, see Ethics Safety Alignment.
Significance¶
- Anthropic's primary research direction (largest mech-interp team in the world)
- OpenAI's Superalignment team worked on this before dissolution
- Frontier AI labs argue this is essential for safe AGI
- Demonstrates research-depth in interviews at top AI labs
★ Deep Dive¶
Key Concepts¶
SUPERPOSITION:
The biggest insight in mech-interp.
Problem: A model with 768 neurons represents > 768 concepts.
How? SUPERPOSITION — multiple features share the same neurons.
Analogy: Storing 1,000 songs on 100 CDs using compression.
Each CD contributes a little to many songs.
┌─────────────────────────────────────────────┐
│ Neuron 42 responds to: │
│ "Python code" (activation: 0.7) │
│ "Snakes" (activation: 0.3) │
│ "British comedy" (activation: 0.1) │
│ │
│ These features are SUPERPOSED in one neuron│
│ The model uses directions in activation │
│ space, not individual neurons, to encode │
│ concepts. │
└─────────────────────────────────────────────┘
FEATURES:
The actual concepts the model represents internally.
Not neurons — features are DIRECTIONS in activation space.
Example features found via sparse autoencoders:
- "This text is in French"
- "This number is a year"
- "The Golden Gate Bridge" (famous Anthropic discovery)
- "Code contains a bug"
- "Deceptive behavior" (safety-critical!)
CIRCUITS:
Connected patterns of features that implement specific behaviors.
Example: "Indirect Object Identification" circuit
"Mary gave the book to ___" → [John]
Attention head A finds "Mary" (subject)
Attention head B finds "John" (recipient)
Attention head C copies "John" to output position
Three heads working together = a circuit
Sparse Autoencoders (SAEs)¶
HOW TO EXTRACT FEATURES FROM SUPERPOSITION:
Problem: Neurons are polysemantic (respond to many things).
Solution: Train a sparse autoencoder to decompose activations.
┌────────────────────────────────────────────┐
│ Model activation (768-dim) │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ ENCODER │ 768 → 65,536 dimensions│
│ │ (expand) │ (overcomplete) │
│ └──────┬───────┘ │
│ │ + sparsity constraint │
│ │ (only ~100 of 65K neurons active)│
│ ▼ │
│ ┌──────────────┐ │
│ │ SPARSE HIDDEN │ Most are ZERO │
│ │ FEATURES │ Active ones = │
│ │ │ interpretable features │
│ └──────┬───────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ DECODER │ 65,536 → 768 │
│ │ (reconstruct) │ Reconstruct original │
│ └──────────────┘ │
│ │
│ Training: minimize reconstruction error │
│ + sparsity penalty │
└────────────────────────────────────────────┘
Result: Each of the 65K sparse neurons corresponds
to ONE interpretable feature (ideally).
Anthropic found ~10 million features in Claude 3 Sonnet!
Research Techniques¶
| Technique | What It Does | How |
|---|---|---|
| Activation patching | Test if a component is necessary for a behavior | Replace its output, see if behavior changes |
| Probing | Check if information exists in a layer | Train a linear classifier on activations |
| Ablation | Remove a component and measure impact | Zero out neurons/heads, check output |
| Logit lens | See what each layer "thinks" the next token is | Project hidden states directly to vocabulary |
| Sparse autoencoders | Extract interpretable features | Overcomplete autoencoder with sparsity |
| Causal tracing | Find where a fact is stored | Corrupt inputs, restore at each layer |
Notable Discoveries¶
| Discovery | Who | What |
|---|---|---|
| Induction heads | Anthropic (2022) | Attention heads that implement in-context learning ("A B ... A → B") |
| Golden Gate Claude | Anthropic (2024) | Amplifying the "Golden Gate Bridge" feature made Claude obsessed with it |
| 10M features | Anthropic (2024) | Extracted 10M interpretable features from Claude 3 Sonnet via SAEs |
| ROME | Meng et al. (2022) | Located and edited specific facts in GPT models |
| Deception features | Anthropic (2024) | Found features that activate when model is being "deceptive" |
◆ Quick Reference¶
WHY IT MATTERS FOR SAFETY:
1. Find deception: detect features that activate when
model is strategically being dishonest
2. Understand capabilities: know what model CAN do vs DOES
3. Controlled editing: surgically modify behavior
4. Trust: "We can verify the model works as intended"
KEY TERMS:
Monosemantic = neuron responds to one concept
Polysemantic = neuron responds to many concepts
Superposition = features > neurons (compression)
Circuit = connected features implementing behavior
SAE = sparse autoencoder (feature extraction)
Logit lens = what each layer predicts
Activation patching = causal intervention
RESEARCH GROUPS:
Anthropic Interpretability team (largest)
Google DeepMind (mech-interp group)
EleutherAI (open-source research)
Independent researchers (Neel Nanda, others)
○ Interview Angles¶
- Q: What is superposition in neural networks?
-
A: Neural networks represent more concepts (features) than they have neurons. Features are encoded as DIRECTIONS in activation space, not individual neurons. Multiple features share the same neurons through superposition — similar to how compressed audio encodes many frequencies in fewer data points. Sparse autoencoders can decompose these back into individual features.
-
Q: Why does mechanistic interpretability matter for AI safety?
- A: We need to understand what models are doing internally — not just what they output. Mech-interp can detect deceptive behavior (features that activate during strategic dishonesty), verify alignment (the model genuinely follows safety training, not just surface compliance), and enable targeted interventions (edit specific behaviors without retraining).
★ Code & Implementation¶
Integrated Gradients (Feature Importance for LLMs)¶
# pip install transformers>=4.40 torch>=2.3 captum>=0.7
# ⚠️ Last tested: 2026-04 | Requires: transformers>=4.40, captum>=0.7
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from captum.attr import IntegratedGradients
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
def predict(input_ids: torch.Tensor) -> torch.Tensor:
return model(input_ids).logits
text = "This movie is absolutely fantastic!"
inputs = tokenizer(text, return_tensors="pt")
ids = inputs["input_ids"] # (1, seq_len)
tokens = tokenizer.convert_ids_to_tokens(ids[0])
# Baseline: all-PAD tokens (neutral reference)
baseline = torch.zeros_like(ids)
ig = IntegratedGradients(predict)
attrs, delta = ig.attribute(ids, baseline, target=1, return_convergence_delta=True)
# attrs: (1, seq_len) — positive = contributes to POSITIVE class
importance = attrs[0].detach().numpy()
print(f"Convergence delta: {delta.item():.4e} (should be near 0)")
print("\nToken Attribution Scores:")
for token, score in sorted(zip(tokens, importance), key=lambda x: -abs(x[1])):
bar = "█" * int(abs(score) * 30)
sign = "+" if score > 0 else "-"
print(f" {token:<15} {sign}{abs(score):.4f} {bar}")
★ Connections¶
| Relationship | Topics |
|---|---|
| Builds on | Transformers, Neural Networks, Linear Algebra For Ai |
| Leads to | AI safety, Ethics Safety Alignment, Trustworthy AI |
| Compare with | Behavioral evaluation (external), Explainable AI (XAI, surface-level) |
| Cross-domain | Neuroscience, Reverse engineering, Complex systems |
◆ Production Failure Modes¶
| Failure | Symptoms | Root Cause | Mitigation |
|---|---|---|---|
| Explanation infidelity | Explanation doesn't match actual model reasoning | Post-hoc explanations approximate, not exact | Mechanistic interpretability, circuit-level analysis |
| Feature attribution noise | Saliency maps highlight irrelevant tokens | Gradient saturation, adversarial sensitivity | Integrated gradients, multiple attribution methods, sanity checks |
| Interpretability theater | Explanations satisfy auditors but don't reveal real risks | Using simple metrics as proxy for understanding | Causal interventions, ablation studies, not just correlation |
◆ Hands-On Exercises¶
Exercise 1: Probe a Model's Internal Representations¶
Goal: Train a linear probe to detect what information a model encodes Time: 45 minutes Steps: 1. Extract hidden states from a language model for 1000 sentences 2. Train a linear classifier to predict sentence properties (sentiment, topic, syntax) 3. Compare probing accuracy at different layers 4. Identify which layers encode which types of information Expected Output: Layer-by-layer probing accuracy chart
★ Recommended Resources¶
| Type | Resource | Why |
|---|---|---|
| 📄 Paper | Olah et al. "Zoom In" (2020) | Beautiful interactive visualization of neural network features |
| 📄 Paper | Anthropic — "Scaling Monosemanticity" (2023) | Extracting interpretable features from language models |
| 🎥 Video | 3Blue1Brown — "Neural Networks" | Visual intuition for neural network internals |
★ Sources¶
- Anthropic, "Scaling Monosemanticity" (2024) — https://transformer-circuits.pub/2024/scaling-monosemanticity/
- Anthropic, "A Mathematical Framework for Transformer Circuits" (2021)
- Neel Nanda, "200 Concrete Open Problems in Mechanistic Interpretability" (2022)
- TransformerLens library — https://github.com/neelnanda-io/TransformerLens
- Meng et al., "Locating and Editing Factual Associations in GPT" (ROME, 2022)