Bilingual ASR model with language identification for Brazilian Portuguese and South-American Spanish
Felipe Farias, William Alberto Cruz Castañeda, Wilmer Lobato, Marcellus Amadeus

DOI: 10.14209/sbrt.2023.1570916069
Evento: XLI Simpósio Brasileiro de Telecomunicações e Processamento de Sinais (SBrT2023)
Keywords: asr language identification
Creating accurate and reliable low-resource automatic speech recognition (ASR) models remains challenging due to limited curated data. This work presents the development of a bilingual ASR model for Brazilian Portuguese and South-American Spanish. The model utilizes the Wav2Vec2.0 architecture and is trained on multiple speech datasets. It combines Language Identification and Speech Recognition, employing a joint feature encoder and task-specific context encoders. Evaluation on the Multilingual Librispeech dataset demonstrates promising results, with an average accuracy of 75.98\% for language identification and a competitive Word Error Rate of 30.45\% in a bilingual setting, comparable to the state-of-the-art Whisper model.