Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA
r/artificial
•
Generative AI
Computer Vision
I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc. There were 171 questions in total, using Claude Sonnet 4.5 as the