Open Access iconOpen Access

ARTICLE

Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition

by Somin Park1, Mpabulungi Mark1, Bogyung Park2, Hyunki Hong1,*

1 College of Software, Chung-Ang University, Seoul, 06973, Korea
2 Department of AI, Chung-Ang University, Seoul, 06973, Korea

* Corresponding Author: Hyunki Hong. Email: email

Computers, Materials & Continua 2023, 77(1), 1009-1030. https://doi.org/10.32604/cmc.2023.041332

Abstract

Speech emotion recognition is essential for frictionless human-machine interaction, where machines respond to human instructions with context-aware actions. The properties of individuals’ voices vary with culture, language, gender, and personality. These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition (SER). This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models. In the proposed approach, two wav2vec-based modules (a speaker-identification network and an emotion classification network) are trained with the Arcface loss. The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation. The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation. These two representations are then fused into a single vector representation containing emotion and speaker-specific information. Experimental results showed that the use of speaker-specific characteristics improves SER performance. Additionally, combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability, as demonstrated by the plots of t-distributed stochastic neighbor embeddings (t-SNE). The proposed approach outperforms previous methods using similar training strategies, with a weighted accuracy (WA) of 72.14% and unweighted accuracy (UA) of 72.97% on the Interactive Emotional Dynamic Motion Capture (IEMOCAP) dataset. This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.

Keywords


Cite This Article

APA Style
Park, S., Mark, M., Park, B., Hong, H. (2023). Using speaker-specific emotion representations in wav2vec 2.0-based modules for speech emotion recognition. Computers, Materials & Continua, 77(1), 1009-1030. https://doi.org/10.32604/cmc.2023.041332
Vancouver Style
Park S, Mark M, Park B, Hong H. Using speaker-specific emotion representations in wav2vec 2.0-based modules for speech emotion recognition. Comput Mater Contin. 2023;77(1):1009-1030 https://doi.org/10.32604/cmc.2023.041332
IEEE Style
S. Park, M. Mark, B. Park, and H. Hong, “Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition,” Comput. Mater. Contin., vol. 77, no. 1, pp. 1009-1030, 2023. https://doi.org/10.32604/cmc.2023.041332



cc Copyright © 2023 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 843

    View

  • 389

    Download

  • 0

    Like

Share Link