AI RESEARCH

An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits

arXiv CS.LG

ArXi:2605.24583v1 Announce Type: new We audit alignment-induced shifts in residual-stream activations of three open-weight instruction-tuned LLMs (Llama-3.1-8B-Instruct, Gemma-2-9B-it, Qwen-2.5-7B-Instruct) using the effective rank of the alignment modification matrix on safety-relevant inputs, rho_eps:= rank_eps(M_Ds)/d, which formalizes the single-refusal-direction observation of Arditi as a continuous quantity. The paper has three contributions.