AI RESEARCH

Building Better Activation Oracles

arXiv CS.LG

ArXi:2606.02609v1 Announce Type: new Activation Oracles (AOs) are promising methods for interpreting residual stream activations. However, current AOs face important issues, such as hallucinations and vagueness. Additionally, text-inversion confounds make them hard to evaluate. To this end, we improve the Activation Oracle (AO)