O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding

ArXi:2605.26584v1 Announce Type: new Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We