Interpretability Without Tradeoffs: Disentangling Polysemanticity At Equal Predictive Performance

ArXi:2605.31304v1 Announce Type: new Deep neural networks (DNNs) are widely used, but interpreting what they actually learn remains difficult. A major obstacle is that individual neurons often encode multiple unrelated concepts, obscuring the decision process of the network. While prior work, such as sparse autoencoders, can separate these mixed signals into meaningful, "monosemantic" features, this typically requires altering the model in ways that can degrade downstream performance. To overcome this, we.