Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

ArXi:2510.18439v3 Announce Type: replace Hallucination, where models generate fluent text uned by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input.