Extreme Region Policy Distillation

ArXi:2605.25582v1 Announce Type: cross Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse