Insider Attacks on Multi-Agent LLM Consensus: arXiv 2605.08268

About This Tutorial

The headline finding in one paragraph The paper, posted on arXi on 8 May 2026 as 2605.08268, studies what happens when an adversary controls a minority of agents inside a multi-agent LLM consensus system. Instead of brute-forcing prompt-injection payloads, the authors train a reinforcement-learning attacker on top of a learned world-model: a surrogate that predicts how benign agents' behavioural states evolve given the messages they see. The attacker then chooses messages that nudge the surrogate's predicted votes in the adversary's favour.