DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

ArXi:2605.26038v1 Announce Type: cross Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing