HOLA: Holistic Multi-Modal Alignment for Open-Set 3D Recognition

ArXi:2606.01334v1 Announce Type: new Open-set 3D recognition requires models that generalize to rare or unseen categories. Recent approaches address this by distilling language-vision knowledge into 3D encoders, typically relying on heavy 2D ViTs and aligning each point cloud with a single image or caption, thus anchoring representations to partial views. We propose aligning each point cloud with multiple images and textual descriptions to capture a holistic understanding of 3D objects.