BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

ArXi:2605.27050v1 Announce Type: cross We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95M people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78M sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to morphology-aware analysis.