AI RESEARCH

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

arXiv CS.CL

ArXi:2605.21849v1 Announce Type: cross Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations.