Recognizing Co-Speech Gestures in-the-Wild

ArXi:2605.31589v1 Announce Type: new While humans naturally gesture during speech, only a sparse subset of these movements are visually depictive and semantically linked to specific spoken words. Current multimodal models struggle to capture these semantic co-speech gestures, heavily bottlenecked by a lack of precisely annotated