AI RESEARCH

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

arXiv CS.LG

ArXi:2605.22719v1 Announce Type: new We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen's d| > 0.8