"In\^{t}elegi Rom\^ane\c{s}te?'' A Recipe for Romanian Vision-Language Models

ArXi:2605.31401v1 Announce Type: new Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of building a language-specific VLM for Romanian, covering the full pipeline from data construction to architectural choices. We translate established English