Jailbreak Attack Initializations as Extractors of Compliance Directions

ArXi:2502.09755v4 Announce Type: replace-cross Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations.