AI RESEARCH

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

arXiv CS.AI

Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting.