Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition

ArXi:2605.23288v1 Announce Type: new Recent Open-Vocabulary Action Recognition (OVAR) methods typically aggregate visual features into a global representation before computing text alignment, a process that obscures local patch information and fine-grained spatio-temporal cues. We propose Similarity Volume Aggregation (SimVA), a framework that constructs a dense 4D spatio-temporal similarity volume from patch-level visual-text similarities.