AI RESEARCH

Realistic honeypot evaluations for scheming propensity

arXiv CS.LG

We introduce scheming honeypot evaluations, a framework for testing whether models will pursue instrumental goals if given the opportunity. In a real internal deployment setting, Gemini models do not demonstrate unprompted scheming. If prompts explicitly encourage agency (situational awareness or goal-directedness) and/or give the model a hidden goal, models sometimes scheme or attempt sabot