AI RESEARCH
Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning
arXiv CS.CL
•
ArXi:2509.06948v3 Announce Type: replace Supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) are two widely used post-