Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning

ArXi:2605.24743v1 Announce Type: cross While LLMs excel at single-turn generation, they struggle with long-horizon, multi-turn interactions. Offline reinforcement learning (RL) offers a scalable approach, yet its performance hinges on the availability and quality of multi-turn trajectory data. A common remedy is to augment