AI RESEARCH
Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration
arXiv CS.LG
•
ArXi:2605.28184v1 Announce Type: new Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pre