AI RESEARCH
Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?
arXiv CS.LG
•
ArXi:2605.20448v1 Announce Type: cross Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We