\textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

ArXi:2605.21282v1 Announce Type: new Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are expressive, but often require iterative sampling or lack tractable entropy estimates.