Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

ArXi:2606.02679v1 Announce Type: new Multimodal systems often benefit from combining information across language, sound, and visual streams, but this benefit is not guaranteed. A modality that is useful for one input may become distracting for another, and local feature responses within the same modality can disagree with evidence from other sources. This work investigates how to adjust multimodal representations before they are merged by a downstream predictor.