NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

ArXi:2606.04773v1 Announce Type: cross Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we