AI RESEARCH
CLIP-like Model as a Foundational Density Ratio Estimator
arXiv CS.CV
•
ArXi:2506.22881v3 Announce Type: replace Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image-text distributions, which implicitly learn similarity scores proportional to log density ratios.