AI RESEARCH

Behavioural Analysis of Alignment Faking

arXiv CS.AI

ArXi:2605.27681v1 Announce Type: new Alignment faking (AF) refers to a model strategically complying with a