Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

ArXi:2505.18603v2 Announce Type: replace Document understanding aims to perform question answering and information extraction over document images, where the visual content is highly information-dense and most queries rely on only a few relevant layout regions. However, existing methods either adopt a one-pass strategy that implicitly assumes all layouts are equally important, or focus excessively on small regions at the cost of losing critical layout information. To address these limitations, we