Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

ArXi:2510.00845v4 Announce Type: replace-cross Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis