TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

ArXi:2601.11178v2 Announce Type: replace Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise and target identities, required for effective human-in-the-loop moderation. In this work, we.