Peizhu Gong1, Jin Liu1, Zhongdai Wu2, Bing Han2, Y. Ken Wang3, Huihua He4,*
CMC-Computers, Materials & Continua, Vol.74, No.2, pp. 4203-4220, 2023, DOI:10.32604/cmc.2023.028291
- 31 October 2022
Abstract Speech emotion recognition, as an important component of human-computer interaction technology, has received increasing attention. Recent studies have treated emotion recognition of speech signals as a multimodal task, due to its inclusion of the semantic features of two different modalities, i.e., audio and text. However, existing methods often fail in effectively represent features and capture correlations. This paper presents a multi-level circulant cross-modal Transformer (MLCCT) for multimodal speech emotion recognition. The proposed model can be divided into three steps, feature extraction, interaction and fusion. Self-supervised embedding models are introduced for feature extraction, which give a… More >