Vision Transformers Need Better Token Interaction

ArXi:2605.23868v1 Announce Type: new Vision Transformers (ViTs) can. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emph{semantic diffusion}: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified.