Compositional Semantics for Open Vocabulary Spatio-semantic Representations

ArXi:2310.04981v2 Announce Type: replace-cross Vision-language models (VLMs) transform environment percepts into vision-language semantics interpretable by LLMs. However, completing complex tasks often requires reasoning about information beyond what is currently perceived. We propose latent compositional semantic embeddings z* as a principled learning-based knowledge representation for queryable spatio-semantic memories. We mathematically prove that z* can always be found, and that the optimal z* is the centroid for any set Z.