Unsupervised Semantic Segmentation Facilitates Model Understanding

ArXi:2605.29691v1 Announce Type: new Self-supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations a wide range of downstream tasks. Towards a better understanding of these models, a body of work has assessed the mechanics of their self-attention as well as the types of information captured across their representations, revealing, for example, stark differences between models trained with contrastive learning (CL) and masked image modeling.