Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

ArXi:2605.20693v1 Announce Type: new Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label.