Bypassing Prompt Guards in Production with Controlled-Release Prompting

ArXi:2510.01529v3 Announce Type: replace Ball recently established that prompt filtering for AI alignment faces a fundamental barrier: under standard cryptographic assumptions, no filter running significantly faster than the protected model can universally distinguish adversarial prompts from benign ones. We investigate whether this impossibility result translates to real-world vulnerabilities in deployed large language model (LLM) systems. We answer affirmatively by