DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

ArXi:2605.30027v1 Announce Type: new Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information.