Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

ArXi:2605.23036v1 Announce Type: new Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that