Learning What to Recommend: Minimax Optimal Simple Regret in Logistic Bandits

ArXi:2601.21167v2 Announce Type: replace We study stochastic logistic bandits with $d$-dimensional action features under the simple-regret objective, where a learner uses $T$ rounds of exploration to output a single final action. The logistic structure is essential here: because the informativeness of an action depends on the local curvature of the sigmoid, actions that are best for immediate reward need not be the most useful for identifying the best final recommendation.