1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features

Mustaqeem; Soonil Kwon

doi:10.32604/cmc.2021.015070

Open Access icon Open Access

ARTICLE

1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features

Mustaqeem, Soonil Kwon^*

Interaction Technology Laboratory, Department of Software, Sejong University, Seoul, 05006, Korea

* Corresponding Author: Soonil Kwon. Email: email

(This article belongs to the Special Issue: Deep Learning Trends in Intelligent Systems)

Computers, Materials & Continua 2021, 67(3), 4039-4059. https://doi.org/10.32604/cmc.2021.015070

Received 05 November 2020; Accepted 12 January 2021; Issue published 01 March 2021

Abstract

Emotion recognition from speech data is an active and emerging area of research that plays an important role in numerous applications, such as robotics, virtual reality, behavior assessments, and emergency call centers. Recently, researchers have developed many techniques in this field in order to ensure an improvement in the accuracy by utilizing several deep learning approaches, but the recognition rate is still not convincing. Our main aim is to develop a new technique that increases the recognition rate with reasonable cost computations. In this paper, we suggested a new technique, which is a one-dimensional dilated convolutional neural network (1D-DCNN) for speech emotion recognition (SER) that utilizes the hierarchical features learning blocks (HFLBs) with a bi-directional gated recurrent unit (BiGRU). We designed a one-dimensional CNN network to enhance the speech signals, which uses a spectral analysis, and to extract the hidden patterns from the speech signals that are fed into a stacked one-dimensional dilated network that are called HFLBs. Each HFLB contains one dilated convolution layer (DCL), one batch normalization (BN), and one leaky_relu (Relu) layer in order to extract the emotional features using a hieratical correlation strategy. Furthermore, the learned emotional features are feed into a BiGRU in order to adjust the global weights and to recognize the temporal cues. The final state of the deep BiGRU is passed from a softmax classifier in order to produce the probabilities of the emotions. The proposed model was evaluated over three benchmarked datasets that included the IEMOCAP, EMO-DB, and RAVDESS, which achieved 72.75%, 91.14%, and 78.01% accuracy, respectively.

Keywords

Affective computing; one-dimensional dilated convolutional neural network; emotion recognition; gated recurrent unit; raw audio clips

Cite This Article

APA Style

Mustaqeem, , Kwon, S. (2021). 1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features. Computers, Materials & Continua, 67(3), 4039–4059. https://doi.org/10.32604/cmc.2021.015070

Vancouver Style

Mustaqeem , Kwon S. 1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features. Comput Mater Contin. 2021;67(3):4039–4059. https://doi.org/10.32604/cmc.2021.015070

IEEE Style

Mustaqeem and S. Kwon, “1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features,” Comput. Mater. Contin., vol. 67, no. 3, pp. 4039–4059, 2021. https://doi.org/10.32604/cmc.2021.015070

BibTex EndNote RIS

Citations

15

[click to view]

Copyright © 2021 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features

Abstract

Keywords

Cite This Article

Citations

7110

2934

1

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link