Information-Directed Offline-to-Online Reinforcement Learning

ArXi:2605.29405v1 Announce Type: new Decision-making from offline datasets typically warm-starts a policy or score model from fixed offline data and then refines it with limited online interaction. Offline data reduces uncertainty, but it does not remove the need for exploration; it changes what remains to be explored. We formalise this residual uncertainty by the conditional mutual information $I(\chi;\tau_{1:T}\mid\mathcal{D}_N)$ between a learning target $\chi$ and the online trajectories after conditioning on the offline dataset.