GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

ArXi:2510.09260v2 Announce Type: replace-cross Recent work has shown that RLHF is highly susceptible to backdoor attacks. However, existing methods often rely on rare tokens or fixed triggers, limiting their impact in realistic scenarios. In this work, we develop GREAT, a novel framework for crafting natural distributional backdoors in RLHF. Specifically, GREAT targets harmful response generation for a vulnerable user subpopulation featured by semantically violent requests paired with emotionally angry triggers.