Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

ArXi:2606.03785v1 Announce Type: new Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors