Toward Identifiable Sparse Autoencoders

ArXi:2605.31245v1 Announce Type: new Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs are highly unstable: different