Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation

ArXi:2509.03456v2 Announce Type: replace-cross Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes.