An Innovative Approach Utilizing Binary-View Transformer for Speech Recognition Task

Kamal, Muhammad Babar; Khan, Arfat Ahmad; Khan, Faizan Ahmed; Ali Shahid, Malik Muhammad; Wechtaisong, Chitapong; Kamal, Muhammad Daud; Ali, Muhammad Junaid; Uthansakul, Peerapong

doi:10.32604/cmc.2022.024590

Open Access icon Open Access

ARTICLE

An Innovative Approach Utilizing Binary-View Transformer for Speech Recognition Task

by Muhammad Babar Kamal¹, Arfat Ahmad Khan², Faizan Ahmed Khan³, Malik Muhammad Ali Shahid⁴, Chitapong Wechtaisong^2,*, Muhammad Daud Kamal⁵, Muhammad Junaid Ali⁶, Peerapong Uthansakul²

1 COMSATS University Islamabad, Islamabad Campus, 45550, Pakistan
2 Suranaree University of Technology, Nakhon Ratchasima, 30000, Thailand
3 COMSATS University Islamabad, Lahore Campus, 54000, Pakistan
4 COMSATS University Islamabad, Vehari Campus, 61100, Pakistan
5 National University of Sciences & Technology, Islamabad, 45550, Pakistan
6 Virtual University of Pakistan, Islamabad Campus, 45550, Pakistan

* Corresponding Author: Chitapong Wechtaisong. Email: email

Computers, Materials & Continua 2022, 72(3), 5547-5562. https://doi.org/10.32604/cmc.2022.024590

Received 23 October 2021; Accepted 26 November 2021; Issue published 21 April 2022

Abstract

The deep learning advancements have greatly improved the performance of speech recognition systems, and most recent systems are based on the Recurrent Neural Network (RNN). Overall, the RNN works fine with the small sequence data, but suffers from the gradient vanishing problem in case of large sequence. The transformer networks have neutralized this issue and have shown state-of-the-art results on sequential or speech-related data. Generally, in speech recognition, the input audio is converted into an image using Mel-spectrogram to illustrate frequencies and intensities. The image is classified by the machine learning mechanism to generate a classification transcript. However, the audio frequency in the image has low resolution and causing inaccurate predictions. This paper presents a novel end-to-end binary view transformer-based architecture for speech recognition to cope with the frequency resolution problem. Firstly, the input audio signal is transformed into a 2D image using Mel-spectrogram. Secondly, the modified universal transformers utilize the multi-head attention to derive contextual information and derive different speech-related features. Moreover, a feed-forward neural network is also deployed for classification. The proposed system has generated robust results on Google's speech command dataset with an accuracy of 95.16% and with minimal loss. The binary-view transformer eradicates the eventuality of the over-fitting problem by deploying a multi-view mechanism to diversify the input data, and multi-head attention captures multiple contexts from the data's feature map.

Keywords

Convolution neural network; multi-head attention; multi-view; RNN; self-attention; speech recognition; transformer

Cite This Article

APA Style

Kamal, M.B., Khan, A.A., Khan, F.A., Ali Shahid, M.M., Wechtaisong, C. et al. (2022). An innovative approach utilizing binary-view transformer for speech recognition task. Computers, Materials & Continua, 72(3), 5547-5562. https://doi.org/10.32604/cmc.2022.024590

Vancouver Style

Kamal MB, Khan AA, Khan FA, Ali Shahid MM, Wechtaisong C, Kamal MD, et al. An innovative approach utilizing binary-view transformer for speech recognition task. Comput Mater Contin. 2022;72(3):5547-5562 https://doi.org/10.32604/cmc.2022.024590

IEEE Style

M. B. Kamal et al., “An Innovative Approach Utilizing Binary-View Transformer for Speech Recognition Task,” Comput. Mater. Contin., vol. 72, no. 3, pp. 5547-5562, 2022. https://doi.org/10.32604/cmc.2022.024590

BibTex EndNote RIS

Copyright © 2022 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

An Innovative Approach Utilizing Binary-View Transformer for Speech Recognition Task

Abstract

Keywords

Cite This Article

1539

903

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Share Link