AI RESEARCH
Self-Distilled Policy Gradient
arXiv CS.LG
•
ArXi:2606.04036v1 Announce Type: new On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We. therefore.