Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies

ArXi:2605.29384v1 Announce Type: cross We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25.