Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

ArXi:2605.28553v1 Announce Type: new In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we