Skip to content

State Space Models

Bit: Transformers are O(n²) in sequence length. State Space Models (SSMs) are O(n). If SSMs can match transformer quality, they'd enable million-token contexts with linear cost. Mamba showed this is possible — and kicked off the biggest architectural debate since attention.


★ TL;DR

  • What: A family of sequence models based on continuous state-space representations that process sequences in linear time, offering an alternative to the quadratic attention mechanism in transformers
  • Why: Transformer attention is O(n²) in sequence length. SSMs are O(n), enabling much longer sequences at lower cost. This makes them candidates for replacing or augmenting transformers.
  • Key point: Mamba (the leading SSM) achieves transformer-competitive quality with linear-time inference. Hybrid architectures (transformer + Mamba layers) are emerging as a practical middle ground.

★ Overview

Definition

A State Space Model (SSM) maps input sequences to output sequences through a hidden state that evolves according to a continuous dynamical system, then discretized for practical computation.

Scope

Covers: SSM mathematical foundations, the S4 → Mamba evolution, comparison with transformers, hybrid architectures. For transformer architecture, see Transformers. For modern architectures, see Modern Architectures.

Prerequisites


★ Deep Dive

The Computational Complexity Problem

SEQUENCE LENGTH vs COMPUTATION COST:

  Attention (Transformer):  O(n²) per layer
    n = 1K tokens  →  1,000,000 operations
    n = 10K tokens →  100,000,000 operations
    n = 100K tokens → 10,000,000,000 operations  ← expensive!
    n = 1M tokens  →  1,000,000,000,000 operations  ← infeasible!

  SSM (Mamba):              O(n) per layer
    n = 1K tokens  →  1,000 operations
    n = 10K tokens →  10,000 operations
    n = 100K tokens → 100,000 operations  ← cheap!
    n = 1M tokens  →  1,000,000 operations  ← still cheap!

  BUT: Linear-time doesn't help if quality is worse.
  Mamba's key insight: selective state spaces make O(n) quality-competitive.

SSM Mathematics (Simplified)

CONTINUOUS STATE SPACE:
  h'(t) = A h(t) + B x(t)     ← How the hidden state evolves
  y(t)  = C h(t) + D x(t)     ← How output is computed from state

  Where:
    x(t) = input signal at time t
    h(t) = hidden state (memory of past inputs)
    y(t) = output at time t
    A = state transition matrix (how state evolves)
    B = input projection (how input affects state)
    C = output projection (how state produces output)
    D = skip connection (direct input → output)

DISCRETIZATION:
  For digital computation, we discretize with step size Δ:
  h[k] = Ā h[k-1] + B̄ x[k]
  y[k] = C h[k]

  Where Ā = exp(ΔA), B̄ = (ΔA)⁻¹(exp(ΔA) - I)(ΔB)

KEY MAMBA INSIGHT:
  Make B, C, and Δ input-dependent (selective)
  This allows the model to selectively remember or forget
  information based on the current input — like a learned gate.

Evolution: S4 → Mamba → Hybrid

Model Year Innovation Quality vs Transformer
S4 2021 Structured state spaces, efficient long-range modeling Below
H3 2022 Added gating, closer to transformer quality Approaching
Mamba 2023 Selective state spaces (input-dependent parameters) Competitive
Mamba-2 2024 Structured state space duality with attention Competitive+
Jamba (AI21) 2024 Hybrid: Mamba + attention layers Competitive
Zamba 2024 Efficient hybrid architecture Competitive

Transformer vs SSM: Where Each Wins

Aspect Transformer SSM (Mamba)
Short sequences (< 4K) ✅ Strong Good
Long sequences (> 32K) Expensive ✅ Efficient
In-context learning ✅ Excellent Good
Inference speed Slow (KV-cache grows) ✅ Fast (constant state)
Training parallelism ✅ Highly parallel ✅ Parallel (convolution mode)
Recall of specific details ✅ Strong (attention) Weaker (compressed state)
Ecosystem / tooling ✅ Mature Early

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Recall failure Model forgets specific details from early in long context SSM compresses state — specific details can be lost Use hybrid architecture (attention layers for recall)
Tooling gaps Can't use standard inference frameworks vLLM/TGI optimized for transformers, not SSMs Use model-specific serving code, or hybrid models
Training instability Loss spikes during training SSM parameter initialization sensitive Use Mamba's recommended initialization, careful learning rate

○ Interview Angles

  • Q: Why are state space models interesting as an alternative to transformers?
  • A: The core motivation is computational complexity. Transformer attention is O(n²) in sequence length, making million-token contexts extremely expensive. SSMs like Mamba achieve O(n) — linear time — by processing sequences through a recurrent state that's updated at each step. The breakthrough in Mamba was making the state transition input-dependent (selective), allowing the model to learn what to remember and what to forget. In practice, pure SSMs still trail transformers slightly on tasks requiring precise recall of specific tokens, so hybrid architectures (mixing Mamba layers with attention layers) are emerging as the practical direction.

★ Code & Implementation

Minimal SSM Recurrence (Discrete S4/Mamba Pattern)

# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3
# Demonstrates the core SSM recurrence: h_t = A*h_{t-1} + B*x_t, y_t = C*h_t
# This is the linear recurrence at the heart of S4/Mamba

import torch
import torch.nn as nn

class MinimalSSM(nn.Module):
    """Simplified 1D SSM layer demonstrating the recurrence."""
    def __init__(self, d_model: int, d_state: int = 16):
        super().__init__()
        self.d_state = d_state
        # State space matrices (learned)
        self.A = nn.Parameter(torch.randn(d_state, d_state) * 0.01)
        self.B = nn.Parameter(torch.randn(d_state, d_model) * 0.01)
        self.C = nn.Parameter(torch.randn(d_model, d_state) * 0.01)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        x: (batch, seq_len, d_model)
        Returns: y (batch, seq_len, d_model) via sequential recurrence
        Note: production Mamba uses CUDA-optimized parallel scan, not this loop
        """
        batch, seq_len, _ = x.shape
        h = torch.zeros(batch, self.d_state, device=x.device)
        outputs = []
        for t in range(seq_len):
            h = torch.tanh(h @ self.A.T + x[:, t, :] @ self.B.T)
            y_t = h @ self.C.T
            outputs.append(y_t)
        return torch.stack(outputs, dim=1)

# Test: O(n) in memory (no n² attention matrix)
ssm = MinimalSSM(d_model=64, d_state=16)
x   = torch.randn(2, 1024, 64)  # batch=2, seq=1024, d=64
y   = ssm(x)
print(f"Input:  {x.shape}")  # (2, 1024, 64)
print(f"Output: {y.shape}")  # (2, 1024, 64) — same shape, O(n) memory

★ Connections

Relationship Topics
Builds on Transformers, Attention Mechanism
Leads to Efficient long-context models, hybrid architectures, next-gen sequence modeling
Compare with Transformer attention, linear attention, RNNs
Cross-domain Control theory, signal processing, dynamical systems

Type Resource Why
📄 Paper Gu & Dao "Mamba: Linear-Time Sequence Modeling" (2023) The foundational Mamba paper
📄 Paper Gu et al. "Efficiently Modeling Long Sequences with Structured State Spaces" (S4, 2021) The S4 paper that started the SSM revolution
🎥 Video Yannic Kilcher — "Mamba Explained" Detailed paper walkthrough
📄 Paper Dao & Gu "Transformers are SSMs" (Mamba-2, 2024) Unifying SSMs and attention theoretically

◆ Hands-On Exercises

Exercise 1: Compare SSM vs Transformer on Long Sequences

Goal: Benchmark Mamba vs Transformer on sequence length scaling Time: 30 minutes Steps: 1. Run inference with a Transformer model at lengths 1K, 4K, 16K, 64K 2. Run inference with a Mamba model at the same lengths 3. Plot latency vs sequence length for both 4. Compare memory usage at each length Expected Output: Latency and memory charts showing SSM's linear vs Transformer's quadratic scaling


★ Sources

  • Gu & Dao "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)
  • Gu et al. "Efficiently Modeling Long Sequences with Structured State Spaces" (2021)
  • Dao & Gu "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" (2024)
  • Modern Architectures