AI RESEARCH

Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

arXiv CS.CV

ArXi:2605.26381v1 Announce Type: new We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes.