AI RESEARCH
Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs
arXiv CS.LG
•
ArXi:2602.10352v2 Announce Type: replace-cross Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that