Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

ArXi:2605.30917v1 Announce Type: cross As large-scale visual-document corpora such as arXi papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation.