TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

ArXi:2605.25850v1 Announce Type: cross This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization