PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

ArXi:2602.01322v2 Announce Type: replace Sparse autoencoders (SAEs) interpret neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether ''Starbucks'' arises from the composition of ''star'' and ''coffee'' features or merely their co-occurrence.