AI RESEARCH

Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

arXiv CS.LG

ArXi:2602.01903v2 Announce Type: replace This work studies online episodic tabular Marko decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime.