MARS: Margin and Semantic-Aware Data Augmentation for Reward Modeling

ArXi:2602.17658v2 Announce Type: replace-cross Reward modeling is central to alignment pipelines such as RLHF, RLAIF, and PPO-based policy optimization, yet its reliability is constrained by limited and heterogeneous human preference data that are expensive to collect at scale. While synthetic augmentation can expand preference supervision, existing methods often augment uniformly or at the representation level, without targeting examples where the reward model is uncertain or prone to mis-ranking. In this paper, we.