MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

ArXi:2605.26004v1 Announce Type: cross Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal