Open Access
ARTICLE
Exploring Sequential Feature Selection in Deep Bi-LSTM Models for Speech Emotion Recognition
1 Computer Science Department, Future Academy-Higher Future Institute for Specialized Technological Studies, Cairo, 12622, Egypt
2 Department of Computer Science and Information College of Science at Zulfi, Majmaah University, P. O. Box 66, Al-Majmaah, 11952, Saudi Arabia
3 Preparatory Institute for Engineering Studies of Gafsa, Zarroug, Gafsa, 2112, Tunisia
4 Computers and Systems Department, Electronics Research Institute, Cairo, 12622, Egypt
* Corresponding Author: Adel Thaljaoui. Email:
Computers, Materials & Continua 2024, 78(2), 2689-2719. https://doi.org/10.32604/cmc.2024.046623
Received 09 October 2023; Accepted 22 December 2023; Issue published 27 February 2024
Abstract
Machine Learning (ML) algorithms play a pivotal role in Speech Emotion Recognition (SER), although they encounter a formidable obstacle in accurately discerning a speaker’s emotional state. The examination of the emotional states of speakers holds significant importance in a range of real-time applications, including but not limited to virtual reality, human-robot interaction, emergency centers, and human behavior assessment. Accurately identifying emotions in the SER process relies on extracting relevant information from audio inputs. Previous studies on SER have predominantly utilized short-time characteristics such as Mel Frequency Cepstral Coefficients (MFCCs) due to their ability to capture the periodic nature of audio signals effectively. Although these traits may improve their ability to perceive and interpret emotional depictions appropriately, MFCCS has some limitations. So this study aims to tackle the aforementioned issue by systematically picking multiple audio cues, enhancing the classifier model’s efficacy in accurately discerning human emotions. The utilized dataset is taken from the EMO-DB database, preprocessing input speech is done using a 2D Convolution Neural Network (CNN) involves applying convolutional operations to spectrograms as they afford a visual representation of the way the audio signal frequency content changes over time. The next step is the spectrogram data normalization which is crucial for Neural Network (NN) training as it aids in faster convergence. Then the five auditory features MFCCs, Chroma, Mel-Spectrogram, Contrast, and Tonnetz are extracted from the spectrogram sequentially. The attitude of feature selection is to retain only dominant features by excluding the irrelevant ones. In this paper, the Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) techniques were employed for multiple audio cues features selection. Finally, the feature sets composed from the hybrid feature extraction methods are fed into the deep Bidirectional Long Short Term Memory (Bi-LSTM) network to discern emotions. Since the deep Bi-LSTM can hierarchically learn complex features and increases model capacity by achieving more robust temporal modeling, it is more effective than a shallow Bi-LSTM in capturing the intricate tones of emotional content existent in speech signals. The effectiveness and resilience of the proposed SER model were evaluated by experiments, comparing it to state-of-the-art SER techniques. The results indicated that the model achieved accuracy rates of 90.92%, 93%, and 92% over the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin Database of Emotional Speech (EMO-DB), and The Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets, respectively. These findings signify a prominent enhancement in the ability to emotional depictions identification in speech, showcasing the potential of the proposed model in advancing the SER field.Keywords
Cite This Article
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.