Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

ArXi:2510.27391v2 Announce Type: replace-cross Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we