OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

ArXi:2605.26399v1 Announce Type: new Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference.