CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

ArXi:2604.01604v2 Announce Type: replace While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primarily rely on how strongly features activate on harmful prompts.