Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

ArXi:2605.31196v1 Announce Type: cross Safe human--robot collaboration requires than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We