AI RESEARCH
Extreme Region Policy Distillation
arXiv CS.AI
•
ArXi:2605.25582v1 Announce Type: cross Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse