Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

ArXi:2605.25603v1 Announce Type: new Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation.