Vocal Sentiments: Transformer Based Speech Emotion Recognition

Didar Ali; Muhammad Shahab; Yasir Saleem Afridi; Rehmat Ullah

doi:10.21015/vtse.v13i3.2174

Authors

Didar Ali Department of Computer Systems Engineering, University of Engineering & Technology, Peshawar, Pakistan https://orcid.org/0009-0002-0721-5306
Muhammad Shahab Department of Computer Systems Engineering, University of Engineering & Technology, Peshawar, Pakistan https://orcid.org/0009-0002-9616-1011
Yasir Saleem Afridi Department of Computer Systems Engineering, University of Engineering & Technology, Peshawar, Pakistan https://orcid.org/0000-0003-0866-0815
Rehmat Ullah Department of Computer Systems Engineering, University of Engineering & Technology, Peshawar, Pakistan https://orcid.org/0000-0002-4889-7115

DOI:

https://doi.org/10.21015/vtse.v13i3.2174

Abstract

Speech Emotion Recognition (SER) plays a crucial role in Human–Computer Interaction (HCI) by enabling systems to interpret and respond to human emotions through speech analysis. This paper presents a Transformer-based SER framework that leverages the Wav2Vec2 model for self-supervised representation learning. Unlike conventional approaches relying on handcrafted acoustic features or shallow learning, our approach employs transfer learning to extract high-level contextual embeddings from raw audio. We integrate two benchmark datasets, RAVDESS and TESS, to improve generalization across diverse speakers and emotions, and further analyze system robustness by introducing varying levels of environmental noise. The proposed model achieves an accuracy of 79.01%, with balanced precision, recall, and F1-scores, demonstrating competitive performance compared with recent state-of-the-art SER models. The main contributions of this work are threefold: (i) a novel evaluation of Wav2Vec2 embeddings on combined RAVDESS–TESS data, (ii) a systematic assessment of noise robustness in Transformer-based SER, and (iii) a comprehensive benchmark that highlights the strengths and limitations of transfer learning in practical emotion recognition scenarios. These findings suggest broad applicability in voice assistants, call-center analytics, and mental health monitoring, while future extensions may incorporate multimodal data and advanced fine-tuning strategies to further enhance performance.

References

R. A. Khalil, E. Jones, M. Inayat, B. Tufail, J. M. Woodward, H. Zafar, and T. A. Malik, “Speech emotion recognition using deep learning techniques: A review,” IEEE Access, vol. 7, pp. 117327–117345, 2019.

A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech recognition using deep neural networks: A systematic review,” IEEE Access, vol. 7, pp. 19143–19165, 2019.

L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using Wav2Vec 2.0 embeddings,” arXiv preprint arXiv:2104.03502, 2021.

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 12449–12460, 2020.

S. M. George et al., “A review on speech emotion recognition: A survey, recent advances and open challenges,” Pattern Recognition, vol. 172, p. 109923, 2024.

J. Yang et al., “Ensemble deep learning with HuBERT for speech emotion recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process. (ICASSP), 2023, pp. 1–5.

Z. Ma et al., “emotion2vec: Self-supervised pre-training for speech emotion representation,” in Findings of the Association for Computational Linguistics: ACL, 2024, pp. 15747–15760.

S. Latif, J. Qadir, and M. Bilal, “A lightweight technique for speech emotion recognition: Comparative study of pre-trained models,” in Proc. Interspeech, 2023, pp. 3462–3466.

A. Chakhtouna, “Unveiling embedded features in Wav2Vec2 and HuBERT for speech emotion recognition,” Procedia Computer Science, vol. 233, pp. 714–723, 2024.

W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021.

S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” arXiv preprint arXiv:2110.13900, 2021.

Y. Zhao et al., “General-purpose speech representation learning through a self-supervised multi-granularity framework,” arXiv preprint arXiv:2102.07513, 2021.

M. Li et al., “Contrastive unsupervised learning for speech emotion recognition,” arXiv preprint arXiv:2102.06357, 2021.

R. Zhang, H. Wu, W. Li, D. Jiang, W. Zou, and X. Li, “Transformer-based unsupervised pre-training for acoustic representation learning,” arXiv preprint arXiv:2007.14602, 2021.

S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.

A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. Int. Conf. Learn. Representations (ICLR), 2020.

A. Vaswani et al., “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-Softmax,” arXiv preprint arXiv:1611.01144, 2016.

J. Bgn, “An illustrated tour of wav2vec 2.0,” Blog post, Sep. 2021. [Online]. Available: https://jonathanbgn.com/2021/09/30/illustrated-wav2vec-2.html

S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” PLoS ONE, vol. 13, no. 5, p. e0196391, 2018. DOI: https://doi.org/10.1371/journal.pone.0196391

E. J. Lok and S. R. Livingstone, “Toronto Emotional Speech Set (TESS),” University of Toronto, Psychology Dept., 2017.

J. Hyeon et al., “Improving speech emotion recognition by fusing self-supervised models and auxiliary features,” Pattern Recognit. Lett., vol. 182, pp. 73–79, 2024.

A. Wilf et al., “Towards noise robust speech emotion recognition,” in Proc. Interspeech/ACII, 2022, pp. 1–5.

X. Jiao et al., “Enhancing speech emotion recognition via multi-spatial fusion and hierarchical cooperative attention,” arXiv preprint arXiv:2404.13509, 2024.

D. Diatlova et al., “Adapting WavLM for speech emotion recognition,” arXiv preprint arXiv:2405.04485, 2024.

Vocal Sentiments: Transformer Based Speech Emotion Recognition

Authors

DOI:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information

ISSN

Make a Submission