Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

ArXi:2605.25820v1 Announce Type: new Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation, but also which positions should be committed together as context for later decoding steps.