CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

ArXi:2606.00172v1 Announce Type: new Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-relative advantages vanish when all sampled trajectories for a prompt are either correct or incorrect.