Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

ArXi:2605.24834v1 Announce Type: cross Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning.