MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

ArXi:2605.26343v1 Announce Type: new Mechanistic interpretability has identified small sets of attention heads that implement specific behaviours in transformer language models, but recovering these circuits typically requires a bespoke analytical pipeline for each new task. We recast circuit discovery as a reinforcement-learning problem.