Adaptive Exploration for Latent-State Bandits

ArXi:2602.05139v3 Announce Type: replace We study bandits whose rewards depend on an unobserved Marko state that evolves independently of the learner's actions. The optimal arm can change even though the learner observes only past actions and rewards. We propose algorithms that feed LinUCB with two summaries of the hidden state: a lagged action-reward pair and, when available, a probe fingerprint formed from rewards of multiple arms. The adaptive variants refresh the fingerprint using residual, margin, and staleness tests.