AI RESEARCH

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

arXiv CS.LG

ArXi:2605.27062v1 Announce Type: cross State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11M), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200M speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users.