Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

ArXi:2605.30984v1 Announce Type: cross Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders.