ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

ArXi:2605.27819v1 Announce Type: cross Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We