A DeepSpeech2-Inspired Convolutional Recurrent Framework for Low-Resource Urdu Speech Recognition

Syed Azeem Inam; Syed Nazia Ashraf; Hassan Hashim; Syed Wajiha Naim; Muhammad Ahmed Ansari; Ahmed Raza Khanzada

doi:10.21015/vtse.v14i2.2392

Authors

Syed Azeem Inam Department of Artificial Intelligence and Mathematical Sciences, Sindh Madressatul Islam University, Karachi, Pakistan https://orcid.org/0000-0002-8876-0834
Syed Nazia Ashraf Department of Computer Science, Sindh Madressatul Islam University, Karachi, Pakistan https://orcid.org/0009-0002-8554-2230
Hassan Hashim Department of Artificial Intelligence and Mathematical Sciences, Sindh Madressatul Islam University, Karachi, Pakistan https://orcid.org/0000-0002-9963-9984
Syed Wajiha Naim Department of Software Engineering, Sindh Madressatul Islam University, Karachi, Pakistan https://orcid.org/0009-0004-3268-5881
Muhammad Ahmed Ansari Cloud Solutions for IT & Communication Co, Riyadh, Saudia Arabia https://orcid.org/0009-0005-4596-3462
Ahmed Raza Khanzada Department of Artificial Intelligence and Mathematical Sciences, Sindh Madressatul Islam University, Karachi, Pakistan https://orcid.org/0009-0008-1932-9216

DOI:

https://doi.org/10.21015/vtse.v14i2.2392

Abstract

Low-resource language automatic speech recognition is a difficult task due to small annotated corpora, large speaker and phonetic diversity, and the lack of strong end-to-end metrics. The case of Urdu is especially significant because of the high number of speakers and the inability to provide high-performing open automatic speech recognition systems to date. The study presents an end-to-end Urdu speech-to-text model, built upon a DeepSpeech2-inspired convolutional recurrent neural network, which integrates a spectrogram-based acoustic modeling, bidirectional gated recurrent units, and Connectionist Temporal Classification to learn alignment-free transcription. This model was trained and tested on the Urdu subset of the Mozilla Common Voice corpus with 58,119 training utterances and 6,458 validation utterances and evaluated on a held-out test set. The proposed system has shown to converge consistently during training with a validation Word Error Rate of 21.29% and loss of 5.87 at epoch 478, and a final test Word Error Rate of 17.05, Sentence Error Rate of 34.72, and Word Information Loss of 0.41. The proposed model achieved better performance on the same evaluation setting compared with a reduced recurrent baseline, a transformer-based baseline, and a wav2vec2-style baseline, whose WERs were 23.84%, 19.62%, and 18.31%, respectively. Analysis of ablation also indicated that convolutional feature extraction, as well as deep bidirectional temporal modeling, are essential to performance, and error analysis revealed phonetic confusion, dialectal variation, noise, and high-speed speech as the most prevalent causes of recognition error. These results demonstrate that a well-tuned convolutional recurrent model can provide a competitive solution for Urdu automatic speech recognition under low-resource conditions and offers a reproducible reference point for future studies.

References

K. H. Davis, R. Biddulph, and S. Balashek, "Automatic recognition of spoken digits," J. Acoust. Soc. Am., vol. 24, no. 6, pp. 637–642, Nov. 1952, doi: 10.1121/1.1906946.

M. A. Anusuya and S. K. Katti, "Speech recognition by machine, a review," Jan. 2010.

W. Ghai and N. Singh, "Literature review on automatic speech recognition," Int. J. Comput. Appl., vol. 41, no. 8, pp. 42–50, Mar. 2012, doi: 10.5120/5565-7646.

H. Aldarmaki, A. Ullah, S. Ram, and N. Zaki, "Unsupervised automatic speech recognition: A review," Speech Commun., vol. 139, pp. 76–91, Apr. 2022, doi: 10.1016/j.specom.2022.02.005.

K. Riaz, "Baseline for Urdu IR evaluation," in *Proc. 2nd ACM Workshop Improving Non-English Web Searching*, New York, NY, USA: ACM, Oct. 2008, pp. 97–100, doi: 10.1145/1460027.1460045.

A. Daud, W. Khan, and D. Che, "Urdu language processing: A survey," Artif. Intell. Rev., vol. 47, no. 3, pp. 279–311, Mar. 2017, doi: 10.1007/s10462-016-9482-x.

M. Humayoun, H. Hammarström, and A. Ranta, "Urdu morphology, orthography and lexicon extraction," Apr. 2022.

S. Yoon, S. Byun, and K. Jung, "Multimodal speech emotion recognition using audio and text," in 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, Dec. 2018, pp. 112–118, doi: 10.1109/SLT.2018.8639583.

A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, "Speech recognition using deep neural networks: A systematic review," IEEE Access, vol. 7, pp. 19143–19165, 2019, doi: 10.1109/ACCESS.2019.2896880.

A. H. Meftah, Y. A. Alotaibi, and S.-A. Selouani, "Evaluation of an Arabic speech corpus of emotions: A perceptual and statistical analysis," IEEE Access, vol. 6, pp. 72845–72861, 2018, doi: 10.1109/ACCESS.2018.2881096.

M. A. Anusuya and S. K. Katti, "Speech recognition by machine, a review," Jan. 2010.

L. Deng, "Deep learning: From speech recognition to language and multimodal processing," APSIPA Trans. Signal Inf. Process., vol. 5, no. 1, 2016, doi: 10.1017/ATSIP.2015.22.

L. Deng and X. Li, "Machine learning paradigms for speech recognition: An overview," IEEE Trans. Audio Speech Lang. Process., vol. 21, no. 5, pp. 1060–1089, May 2013, doi: 10.1109/TASL.2013.2244083.

J. Guo et al., "The HW-TSC's simultaneous speech-to-text translation system for IWSLT 2023 evaluation," in Proc. 20th Int. Conf. Spoken Language Translation (IWSLT 2023), Stroudsburg, PA, USA: Association for Computational Linguistics, 2023, pp. 376–382, doi: 10.18653/v1/2023.iwslt-1.35.

J.-X. Zhang et al., "Voice conversion by cascading automatic speech recognition and text-to-speech synthesis with prosody transfer," Sep. 2020.

M. Mohri, F. Pereira, and M. Riley, "Weighted automata in text and speech processing," *arXiv preprint cs/0503077*, 2005.

M. Stinson, S. Stinson, L. Elliot, and R. Kelly, "Relationships between benefit and use of a speech-to-text service, perceptions of courses, and course performance," in Annu. Meeting Amer. Educ. Res. Assoc., San Diego, CA, 2004.

A. Vaswani et al., “Attention Is All You Need,” in Proc. Advances in Neural Information Processing Systems (NeurIPS), I. Guyon et al., Eds., Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. Available:https://proceedings.neurips.cc/paper_files/paper/2017/file/3-f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Y. Wu et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," Sep. 2016.

D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.

K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, "On the properties of neural machine translation: Encoder-decoder approaches," arXiv preprint arXiv:1409.1259, 2014.

I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," Adv. Neural Inf. Process. Syst., vol. 27, 2014.

C.-C. Chiu et al., "State-of-the-art speech recognition with sequence-to-sequence models," in 2018 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), IEEE, 2018, pp. 4774–4778.

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in 2016 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), IEEE, 2016, pp. 4960–4964.

Y. Jia et al., "Leveraging weakly supervised data to improve end-to-end speech-to-text translation," in *ICASSP 2019 - 2019 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP)*, IEEE, May 2019, pp. 7180–7184, doi: 10.1109/ICASSP.2019.8683343.

S. Mehra, V. Ranga, R. Agarwal, and S. Susan, "Speaker independent recognition of low-resourced multilingual Arabic spoken words through hybrid fusion," Multimed. Tools Appl., Mar. 2024, doi: 10.1007/s11042-024-18804-w.

M. Dua, B. Bhagat, S. Dua, and N. Chakravarty, "A review on Gujarati language based automatic speech recognition (ASR) systems," Int. J. Speech Technol., vol. 27, no. 1, pp. 133–156, Mar. 2024, doi: 10.1007/s10772-024-10087-8.

A. Gupta, R. Kumar, and Y. Kumar, "Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages," Multimed. Tools Appl., vol. 83, no. 10, pp. 30145–30166, Sep. 2023, doi: 10.1007/s11042-023-16748-1.

R. Shaik and S. Venkatramaphanikumar, "Sentiment analysis with word-based Urdu speech recognition," J. Ambient Intell. Humaniz. Comput., vol. 13, no. 5, pp. 2511–2531, May 2022, doi: 10.1007/s12652-021-03460-x.

S. Shaikh Naziya and R. R. Deshmukh, "LPC and HMM performance analysis for speech recognition system for Urdu digits," IOSR J. Comput. Eng., 2017.

N. F. Khan, N. Hemanth, N. Goyal, P. KR, and P. Agarwal, "Call translator with voice cloning using transformers," in 2024 IEEE 9th Int. Conf. Convergence Technol. (I2CT), IEEE, Apr. 2024, pp. 1–6, doi: 10.1109/I2CT61223.2024.10543304.

J. R. Bellegarda, "Interaction-driven speech input: A data-driven approach to the capture of both local and global language constraints," ACM SIGCHI Bull., vol. 30, no. 2, pp. 102–105, 1998.

A. Berard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, "End-to-end automatic speech translation of audiobooks," in 2018 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), IEEE, Apr. 2018, pp. 6224–6228, doi: 10.1109/ICASSP.2018.8461690.

A. Berard, O. Pietquin, C. Servan, and L. Besacier, "Listen and translate: A proof of concept for end-to-end speech-to-text translation," Dec. 2016.

E. E. B. Adam, "Deep learning based NLP techniques in text to speech synthesis for communication recognition," J. Soft Comput. Paradigm, vol. 2, no. 4, pp. 209–215, 2020.

X. Huang et al., "Whistler: A trainable text-to-speech system," in Proc. 4th Int. Conf. Spoken Language Process. (ICSLP), 1996, pp. 2387–2390, doi: 10.1109/ICSLP.1996.607289.

Y. Higuchi et al., "A comparative study on non-autoregressive modelings for speech-to-text generation," in 2021 IEEE Autom. Speech Recognit. Understanding Workshop (ASRU), IEEE, Dec. 2021, pp. 47–54, doi: 10.1109/ASRU51503.2021.9688157.

I. Isewon, J. Oyelade, and O. Oladipupo, "Design and implementation of text to speech conversion for visually impaired people," Int. J. Appl. Inf. Syst., vol. 7, no. 2, pp. 25–30, Apr. 2014, doi: 10.5120/ijais14-451143.

M. A. Nadeem, S. H. H. Bukhari, M. U. Arshad, S. Naeem, M. O. Beg, and W. Shahzad, "Language detection and localization, for Pakistani languages, in acoustic channels," in 2022 17th Int. Conf. Emerg. Technol. (ICET), IEEE, Nov. 2022, pp. 142–147, doi: 10.1109/ICET56601.2022.10004691.

S. R. Mache, M. R. Baheti, and C. N. Mahender, "Review on text-to-speech synthesizer," Int. J. Adv. Res. Comput. Commun. Eng., vol. 4, no. 8, pp. 54–59, 2015.

R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, "Sequence-to-sequence models can directly translate foreign speech," Mar. 2017.

Y. Ren et al., "SimulSpeech: End-to-end simultaneous speech to text translation," in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, 2020, pp. 3787–3796, doi: 10.18653/v1/2020.acl-main.350.

S. Möller, F. Hinterleitner, T. H. Falk, and T. Polzehl, "Comparison of approaches for instrumentally predicting the quality of text-to-speech systems," in Eleventh Annu. Conf. Int. Speech Commun. Assoc., 2010.

T. Hayashi et al., "Espnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit," in *ICASSP 2020 - 2020 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP)*, IEEE, May 2020, pp. 7654–7658, doi: 10.1109/ICASSP40776.2020.9053512.

S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," in 2017 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), IEEE, Mar. 2017, pp. 4835–4839, doi: 10.1109/ICASSP.2017.7953075.

Y. Tang, J. Pino, C. Wang, X. Ma, and D. Genzel, "A general multi-task learning framework to leverage text data for speech to text tasks," in *ICASSP 2021 - 2021 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP)*, IEEE, Jun. 2021, pp. 6209–6213, doi: 10.1109/ICASSP39728.2021.9415058.

R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad, and D. P. Kingma, "Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis," in *ICASSP 2021 - 2021 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP)*, IEEE, Jun. 2021, pp. 5679–5683, doi: 10.1109/ICASSP39728.2021.9413851.

A. Tjandra, S. Sakti, and S. Nakamura, "Listening while speaking: Speech chain by deep learning," in 2017 IEEE Autom. Speech Recognit. Understanding Workshop (ASRU), IEEE, Dec. 2017, pp. 301–308, doi: 10.1109/ASRU.2017.8268950.

A. Nenkova, "Summarization evaluation for text and speech: Issues and approaches," in Ninth Int. Conf. Spoken Language Process., 2006.

M. C. Stoian, S. Bansal, and S. Goldwater, "Analyzing ASR pretraining for low-resource speech-to-text translation," in *ICASSP 2020 - 2020 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP)*, IEEE, May 2020, pp. 7909–7913, doi: 10.1109/ICASSP40776.2020.9053847.

P. Bahar, T. Bieschke, and H. Ney, "A comparative study on end-to-end speech to text translation," in 2019 IEEE Autom. Speech Recognit. Understanding Workshop (ASRU), IEEE, Dec. 2019, pp. 792–799, doi: 10.1109/ASRU46091.2019.9003774.

P.-H. Le, H. Gong, C. Wang, J. Pino, B. Lecouteux, and D. Schwab, "Pre-training for speech translation: CTC meets optimal transport," in Proc. 40th Int. Conf. Mach. Learn., A. Krause et al., Eds., in Proceedings of Machine Learning Research, vol. 202, PMLR, Jul. 2023, pp. 18667–18685. [Online]. Available: https://proceedings.mlr.press/v202/le23a.html

Y. Liu, J. Zhu, J. Zhang, and C. Zong, "Bridging the modality gap for speech-to-text translation," Oct. 2020.

A. A. Raza, S. Hussain, H. Sarfraz, I. Ullah, and Z. Sarfraz, "An ASR system for spontaneous Urdu speech," in Proc. Oriental COCOSDA, pp. 24–25, 2010.

S. Naeem et al., "Subspace Gaussian mixture model for continuous Urdu speech recognition using Kaldi," in 2020 14th Int. Conf. Open Source Syst. Technol. (ICOSST), IEEE, Dec. 2020, pp. 1–7, doi: 10.1109/ICOSST51357.2020.9333026.

R. Ardila et al., "Common Voice: A massively-multilingual speech corpus," arXiv preprint arXiv:1912.06670, 2019.

L. Maison and Y. Estève, "Some voices are too common: Building fair speech recognition systems using the Common Voice dataset," arXiv preprint arXiv:2306.03773, 2023.

H. Kwon, D. Park, and O. Jo, "Silent-hidden-voice attack on speech recognition system," IEEE Access, 2024.

B. Arendale, S. Zarandioon, R. Goodwin, and D. Reynolds, "Spoken language recognition on open-source datasets," SMU Data Sci. Rev., vol. 3, no. 2, p. 3, 2020.

G. Cámbara, J. Luque, and M. Farrús, "Convolutional speech recognition with pitch and voice quality features," arXiv preprint arXiv:2009.01309, 2020.

H. A. Z. Shahgir, K. S. Sayeed, and T. A. Zaman, "Applying wav2vec2 for speech recognition on Bengali Common Voices dataset," arXiv preprint arXiv:2209.06581, 2022.

A. Nowakowski and W. Kasprzak, "Automatic speaker's age classification in the Common Voice database," in 2023 18th Conf. Comput. Sci. Intell. Syst. (FedCSIS), IEEE, 2023, pp. 1087–1091.

D. Amodei et al., "Deep Speech 2: End-to-end speech recognition in English and Mandarin," in *Proc. 33rd Int. Conf. Mach. Learn. - Volume 48 (ICML'16)*, JMLR.org, 2016, pp. 173–182.

S. Mehra, V. Ranga, and R. Agarwal, "Multimodal integration of mel spectrograms and text transcripts for enhanced automatic speech recognition: Leveraging extractive transformer-based approaches and late fusion strategies," Comput. Intell., vol. 40, no. 6, Dec. 2024, doi: 10.1111/coin.70012.

H. A. Alsayadi, A. A. Abdelhamid, I. Hegazy, and Z. T. Fayed, "Arabic speech recognition using end-to-end deep learning," IET Signal Process., 2021, doi: 10.1049/sil2.12057.

A. Rahman, Md. M. Kabir, M. F. Mridha, M. Alatiyyah, H. F. Alhasson, and S. S. Alharbi, "Arabic speech recognition: Advancement and challenges," IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3376237.

R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, "End-to-end speech recognition: A survey," IEEE/ACM Trans. Audio Speech Lang. Process., 2024, doi: 10.1109/TASLP.2023.3328283.

J. Tang, J. Hou, Y. Song, L.-R. Dai, and I. McLoughlin, "Effective exploitation of posterior information for attention-based speech recognition," IEEE Access, 2020, doi: 10.1109/ACCESS.2020.3001636.

A. M. Samin et al., "BanSpeech: A multi-domain Bangla speech recognition benchmark toward robust performance in challenging conditions," IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3371478.

A DeepSpeech2-Inspired Convolutional Recurrent Framework for Low-Resource Urdu Speech Recognition

Authors

DOI:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information

ISSN

Scopus Metrics

SCImago

Scopus CiteScore

Make a Submission