Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

ArXi:2605.24033v1 Announce Type: new Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through examples, ablations, and manual reasoning. This leaves a gap between finding a plausible circuit and proving what the circuit does. We