DenseMLLM: Standard Multimodal LLMs for Dense Prediction

ArXi:2602.14134v2 Announce Type: replace-cross Multimodal Large Language Models (MLLMs) have nstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality.