Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

ArXi:2606.02194v1 Announce Type: new Distilling expert nstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is sample-efficient than collecting human nstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model.