✨ Bit: Transformers are O(n²) in sequence length. State Space Models (SSMs) are O(n). If SSMs can match transformer quality, they'd enable million-token contexts with linear cost. Mamba showed this is possible — and kicked off the biggest architectural debate since attention.
What: A family of sequence models based on continuous state-space representations that process sequences in linear time, offering an alternative to the quadratic attention mechanism in transformers
Why: Transformer attention is O(n²) in sequence length. SSMs are O(n), enabling much longer sequences at lower cost. This makes them candidates for replacing or augmenting transformers.
Key point: Mamba (the leading SSM) achieves transformer-competitive quality with linear-time inference. Hybrid architectures (transformer + Mamba layers) are emerging as a practical middle ground.
A State Space Model (SSM) maps input sequences to output sequences through a hidden state that evolves according to a continuous dynamical system, then discretized for practical computation.
Covers: SSM mathematical foundations, the S4 → Mamba evolution, comparison with transformers, hybrid architectures. For transformer architecture, see Transformers. For modern architectures, see Modern Architectures.
CONTINUOUS STATE SPACE:
h'(t) = A h(t) + B x(t) ← How the hidden state evolves
y(t) = C h(t) + D x(t) ← How output is computed from state
Where:
x(t) = input signal at time t
h(t) = hidden state (memory of past inputs)
y(t) = output at time t
A = state transition matrix (how state evolves)
B = input projection (how input affects state)
C = output projection (how state produces output)
D = skip connection (direct input → output)
DISCRETIZATION:
For digital computation, we discretize with step size Δ:
h[k] = Ā h[k-1] + B̄ x[k]
y[k] = C h[k]
Where Ā = exp(ΔA), B̄ = (ΔA)⁻¹(exp(ΔA) - I)(ΔB)
KEY MAMBA INSIGHT:
Make B, C, and Δ input-dependent (selective)
This allows the model to selectively remember or forget
information based on the current input — like a learned gate.
Q: Why are state space models interesting as an alternative to transformers?
A: The core motivation is computational complexity. Transformer attention is O(n²) in sequence length, making million-token contexts extremely expensive. SSMs like Mamba achieve O(n) — linear time — by processing sequences through a recurrent state that's updated at each step. The breakthrough in Mamba was making the state transition input-dependent (selective), allowing the model to learn what to remember and what to forget. In practice, pure SSMs still trail transformers slightly on tasks requiring precise recall of specific tokens, so hybrid architectures (mixing Mamba layers with attention layers) are emerging as the practical direction.
# ⚠️ Last tested: 2026-04 | Requires: torch>=2.3# Demonstrates the core SSM recurrence: h_t = A*h_{t-1} + B*x_t, y_t = C*h_t# This is the linear recurrence at the heart of S4/Mambaimporttorchimporttorch.nnasnnclassMinimalSSM(nn.Module):"""Simplified 1D SSM layer demonstrating the recurrence."""def__init__(self,d_model:int,d_state:int=16):super().__init__()self.d_state=d_state# State space matrices (learned)self.A=nn.Parameter(torch.randn(d_state,d_state)*0.01)self.B=nn.Parameter(torch.randn(d_state,d_model)*0.01)self.C=nn.Parameter(torch.randn(d_model,d_state)*0.01)defforward(self,x:torch.Tensor)->torch.Tensor:""" x: (batch, seq_len, d_model) Returns: y (batch, seq_len, d_model) via sequential recurrence Note: production Mamba uses CUDA-optimized parallel scan, not this loop """batch,seq_len,_=x.shapeh=torch.zeros(batch,self.d_state,device=x.device)outputs=[]fortinrange(seq_len):h=torch.tanh(h@self.A.T+x[:,t,:]@self.B.T)y_t=h@self.C.Toutputs.append(y_t)returntorch.stack(outputs,dim=1)# Test: O(n) in memory (no n² attention matrix)ssm=MinimalSSM(d_model=64,d_state=16)x=torch.randn(2,1024,64)# batch=2, seq=1024, d=64y=ssm(x)print(f"Input: {x.shape}")# (2, 1024, 64)print(f"Output: {y.shape}")# (2, 1024, 64) — same shape, O(n) memory
Exercise 1: Compare SSM vs Transformer on Long Sequences¶
Goal: Benchmark Mamba vs Transformer on sequence length scaling
Time: 30 minutes
Steps:
1. Run inference with a Transformer model at lengths 1K, 4K, 16K, 64K
2. Run inference with a Mamba model at the same lengths
3. Plot latency vs sequence length for both
4. Compare memory usage at each length
Expected Output: Latency and memory charts showing SSM's linear vs Transformer's quadratic scaling