GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

ArXi:2605.29398v1 Announce Type: cross Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-