Linear and Neural Dueling Bandits with Delayed Feedback

ArXi:2605.26554v1 Announce Type: cross Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting solutions, rendering naive adaptations of standard weighting techniques biased.