AI RESEARCH

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

arXiv CS.CL

ArXi:2605.25052v1 Announce Type: new Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable.