Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

ArXi:2606.01287v1 Announce Type: cross Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, however, reveal a paradox: the tokens are loosely tied to the image and contribute little to the answer. Critically, these analyses treat latent tokens as a single unit, obscuring the true source of the gains. We. therefore.