The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition

Mohammad Uddin; Mohammad Salah; Mayeen Khandaker; Nissren Tamam; Abdelmoneim Sulieman

doi:10.32604/cmc.2023.031177

Open Access icon Open Access

ARTICLE

The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition

Mohammad Amaz Uddin¹, Mohammad Salah Uddin Chowdury¹, Mayeen Uddin Khandaker^2,*, Nissren Tamam³, Abdelmoneim Sulieman⁴

1 Department of Computer Science and Engineering, BGC Trust University Bangladesh, Chittagong, 4381, Bangladesh
2 Centre for Applied Physics and Radiation Technologies, School of Engineering and Technology, Sunway University, Bandar Sunway, Selangor, 47500, Malaysia
3 Department of Physics, College of Sciences, Princess Nourah bint Abdulrahman University, P.O Box 84428, Riyadh, 11671, Saudi Arabia
4 Department of Radiology and Medical Imaging, Prince Sattam bin Abdulaziz University, Alkharj, Saudi Arabia

* Corresponding Author: Mayeen Uddin Khandaker. Email: email

Computers, Materials & Continua 2023, 74(1), 1709-1722. https://doi.org/10.32604/cmc.2023.031177

Received 12 April 2022; Accepted 23 May 2022; Issue published 22 September 2022

Abstract

Human speech indirectly represents the mental state or emotion of others. The use of Artificial Intelligence (AI)-based techniques may bring revolution in this modern era by recognizing emotion from speech. In this study, we introduced a robust method for emotion recognition from human speech using a well-performed preprocessing technique together with the deep learning-based mixed model consisting of Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). About 2800 audio files were extracted from the Toronto emotional speech set (TESS) database for this study. A high pass and Savitzky Golay Filter have been used to obtain noise-free as well as smooth audio data. A total of seven types of emotions; Angry, Disgust, Fear, Happy, Neutral, Pleasant-surprise, and Sad were used in this study. Energy, Fundamental frequency, and Mel Frequency Cepstral Coefficient (MFCC) have been used to extract the emotion features, and these features resulted in 97.5% accuracy in the mixed LSTM + CNN model. This mixed model is found to be performed better than the usual state-of-the-art models in emotion recognition from speech. It also indicates that this mixed model could be effectively utilized in advanced research dealing with sound processing.

Keywords

Emotion recognition; Savitzky Golay; fundamental frequency; MFCC; neural networks

Cite This Article

APA Style

Uddin, M.A., Chowdury, M.S.U., Khandaker, M.U., Tamam, N., Sulieman, A. (2023). The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition. Computers, Materials & Continua, 74(1), 1709–1722. https://doi.org/10.32604/cmc.2023.031177

Vancouver Style

Uddin MA, Chowdury MSU, Khandaker MU, Tamam N, Sulieman A. The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition. Comput Mater Contin. 2023;74(1):1709–1722. https://doi.org/10.32604/cmc.2023.031177

IEEE Style

M. A. Uddin, M. S. U. Chowdury, M. U. Khandaker, N. Tamam, and A. Sulieman, “The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition,” Comput. Mater. Contin., vol. 74, no. 1, pp. 1709–1722, 2023. https://doi.org/10.32604/cmc.2023.031177

BibTex EndNote RIS

Copyright © 2023 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition

Abstract

Keywords

Cite This Article

1945

945

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Share Link