Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

ArXi:2606.04970v1 Announce Type: cross We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence.