AI RESEARCH
Value-Free Policy Optimization via Reward Partitioning
arXiv CS.AI
•
ArXi:2506.13702v4 Announce Type: replace-cross Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existing approaches such as Direct Reward Optimization (DRO) have nstrated promising results but rely on value function estimation,