F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

ArXi:2602.06717v2 Announce Type: replace-cross Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, computational limits often rule out very large groups, so