FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

ArXi:2605.26615v1 Announce Type: new Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-