AI RESEARCH
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
arXiv CS.LG
•
ArXi:2605.20654v1 Announce Type: new While Large Language Models (LLMs) nstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory.