Adversarial Dual On-Policy Distillation from Expressive Flow-based Teacher

ArXi:2605.27095v1 Announce Type: new Learning from nstrations in embodied control is often cast as behavioral cloning, and recent diffusion or flow-matching policies improve this paradigm by modeling multi-modal expert actions. Yet these methods remain offline supervised learners: the policy is trained only on expert states and receives no corrective signal on the states it actually visits. On-policy distillation (OPD) offers a natural remedy, but standard OPD assumes a strong fixed teacher, which is unavailable in nstration-only control.