AI RESEARCH
Behavioural Analysis of Alignment Faking
arXiv CS.AI
•
ArXi:2605.27681v1 Announce Type: new Alignment faking (AF) refers to a model strategically complying with a