CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

ArXi:2605.28742v1 Announce Type: new Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of