Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

ArXi:2605.21642v1 Announce Type: new Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or