AI RESEARCH

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

arXiv CS.LG

ArXi:2602.10352v2 Announce Type: replace-cross Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that