AI RESEARCH

Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

arXiv CS.LG

ArXi:2605.20448v1 Announce Type: cross Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We