AI RESEARCH
Realistic honeypot evaluations for scheming propensity
arXiv CS.LG
•
We introduce scheming honeypot evaluations, a framework for testing whether models will pursue instrumental goals if given the opportunity. In a real internal deployment setting, Gemini models do not demonstrate unprompted scheming. If prompts explicitly encourage agency (situational awareness or goal-directedness) and/or give the model a hidden goal, models sometimes scheme or attempt sabot