Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

ArXi:2512.00349v2 Announce Type: replace Are frontier AI systems becoming capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind their performance leaps lie insidious and destructive safety risks, namely deception. Unlike hallucination, which arises from insufficient capability and leads to mistakes, deception represents a deeper threat in which models deliberately mislead users through complex reasoning and insincere responses.