AI RESEARCH
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
arXiv CS.AI
•
Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting.