Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs

ArXi:2605.24939v1 Announce Type: new We study the global convergence of policy gradient for infinite-horizon entropy-regularized Marko decision processes (MDPs) with continuous state and action spaces. We consider log-linear softmax policies with linear function approximation, which extend the tabular softmax parameterization while retaining a tractable policy class. Under $Q^\pi_\tau$-realizability for the regularized state-action value function, we first establish a non-uniform Polyak--{\L}ojasiewicz (P\L) inequality.