AI RESEARCH
Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays
arXiv CS.LG
•
ArXi:2605.23351v1 Announce Type: new We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee.