Training Deliberative Monitors for Black-Box Scheming Detection

ArXi:2605.29601v1 Announce Type: cross As autonomous agents become capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment.