Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

ArXi:2605.20834v1 Announce Type: cross Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses.