AI RESEARCH
UCPO: Uncertainty-Aware Policy Optimization
arXiv CS.AI
•
ArXi:2601.22648v2 Announce Type: replace The key to building trustworthy large language models (LLMs) lies in endowing them with inherent uncertainty expression capabilities, thereby mitigating overconfident errors in high-stakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence.