Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning

ArXi:2512.02019v3 Announce Type: replace-cross Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning (ME-RL) to diffusion processes, enabling sampling from the optimal policy trajectory distribution. By minimizing a tractable upper bound on the reverse KL divergence between the diffusion policy and the optimal policy trajectory distributions, we derive a modified surrogate objective and