InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

ArXi:2605.26520v1 Announce Type: cross While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we