Vocal Sentiments: Transformer Based Speech Emotion Recognition
DOI:
https://doi.org/10.21015/vtse.v13i3.2174Abstract
Speech Emotion Recognition (SER) plays a crucial role in Human–Computer Interaction (HCI) by enabling systems to interpret and respond to human emotions through speech analysis. This paper presents a Transformer-based SER framework that leverages the Wav2Vec2 model for self-supervised representation learning. Unlike conventional approaches relying on handcrafted acoustic features or shallow learning, our approach employs transfer learning to extract high-level contextual embeddings from raw audio. We integrate two benchmark datasets, RAVDESS and TESS, to improve generalization across diverse speakers and emotions, and further analyze system robustness by introducing varying levels of environmental noise. The proposed model achieves an accuracy of 79.01%, with balanced precision, recall, and F1-scores, demonstrating competitive performance compared with recent state-of-the-art SER models. The main contributions of this work are threefold: (i) a novel evaluation of Wav2Vec2 embeddings on combined RAVDESS–TESS data, (ii) a systematic assessment of noise robustness in Transformer-based SER, and (iii) a comprehensive benchmark that highlights the strengths and limitations of transfer learning in practical emotion recognition scenarios. These findings suggest broad applicability in voice assistants, call-center analytics, and mental health monitoring, while future extensions may incorporate multimodal data and advanced fine-tuning strategies to further enhance performance.
References
R. A. Khalil, E. Jones, M. Inayat, B. Tufail, J. M. Woodward, H. Zafar, and T. A. Malik, “Speech emotion recognition using deep learning techniques: A review,” IEEE Access, vol. 7, pp. 117327–117345, 2019.
A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech recognition using deep neural networks: A systematic review,” IEEE Access, vol. 7, pp. 19143–19165, 2019.
L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using Wav2Vec 2.0 embeddings,” arXiv preprint arXiv:2104.03502, 2021.
A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 12449–12460, 2020.
S. M. George et al., “A review on speech emotion recognition: A survey, recent advances and open challenges,” Pattern Recognition, vol. 172, p. 109923, 2024.
J. Yang et al., “Ensemble deep learning with HuBERT for speech emotion recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process. (ICASSP), 2023, pp. 1–5.
Z. Ma et al., “emotion2vec: Self-supervised pre-training for speech emotion representation,” in Findings of the Association for Computational Linguistics: ACL, 2024, pp. 15747–15760.
S. Latif, J. Qadir, and M. Bilal, “A lightweight technique for speech emotion recognition: Comparative study of pre-trained models,” in Proc. Interspeech, 2023, pp. 3462–3466.
A. Chakhtouna, “Unveiling embedded features in Wav2Vec2 and HuBERT for speech emotion recognition,” Procedia Computer Science, vol. 233, pp. 714–723, 2024.
W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021.
S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” arXiv preprint arXiv:2110.13900, 2021.
Y. Zhao et al., “General-purpose speech representation learning through a self-supervised multi-granularity framework,” arXiv preprint arXiv:2102.07513, 2021.
M. Li et al., “Contrastive unsupervised learning for speech emotion recognition,” arXiv preprint arXiv:2102.06357, 2021.
R. Zhang, H. Wu, W. Li, D. Jiang, W. Zou, and X. Li, “Transformer-based unsupervised pre-training for acoustic representation learning,” arXiv preprint arXiv:2007.14602, 2021.
S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. Int. Conf. Learn. Representations (ICLR), 2020.
A. Vaswani et al., “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-Softmax,” arXiv preprint arXiv:1611.01144, 2016.
J. Bgn, “An illustrated tour of wav2vec 2.0,” Blog post, Sep. 2021. [Online]. Available: https://jonathanbgn.com/2021/09/30/illustrated-wav2vec-2.html
S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” PLoS ONE, vol. 13, no. 5, p. e0196391, 2018. DOI: https://doi.org/10.1371/journal.pone.0196391
E. J. Lok and S. R. Livingstone, “Toronto Emotional Speech Set (TESS),” University of Toronto, Psychology Dept., 2017.
J. Hyeon et al., “Improving speech emotion recognition by fusing self-supervised models and auxiliary features,” Pattern Recognit. Lett., vol. 182, pp. 73–79, 2024.
A. Wilf et al., “Towards noise robust speech emotion recognition,” in Proc. Interspeech/ACII, 2022, pp. 1–5.
X. Jiao et al., “Enhancing speech emotion recognition via multi-spatial fusion and hierarchical cooperative attention,” arXiv preprint arXiv:2404.13509, 2024.
D. Diatlova et al., “Adapting WavLM for speech emotion recognition,” arXiv preprint arXiv:2405.04485, 2024.
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC-By) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
This work is licensed under a Creative Commons Attribution License CC BY