Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

ArXi:2605.21652v1 Announce Type: new Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity.