Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

ArXi:2510.05342v2 Announce Type: replace-cross Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal