An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model

Arfat Khan; Rashid Jahangir; Roobaea Alroobaea; Saleh Alyahyan; Ahmed Almulhi; Majed Alsafyani; Chitapong Wechtaisong

doi:10.32604/cmc.2023.036797

Open Access icon Open Access

ARTICLE

An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model

Arfat Ahmad Khan¹, Rashid Jahangir^2,*, Roobaea Alroobaea³, Saleh Yahya Alyahyan⁴, Ahmed H. Almulhi³, Majed Alsafyani³, Chitapong Wechtaisong⁵

1 College of Computing, Khon Kaen University, Khon Kaen, 40000, Thailand
2 Department of Computer Science, COMSATS University Islamabad, Vehari Campus, Vehari, 61100, Pakistan
3 Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif, 21944, Saudi Arabia
4 Department of Computer Science, Community College in Dwadmi, Sharqa University, Dawadmi, 17472, Saudi Arabia
5 School of Telecommunication Engineering, Suranaree University of Technology, Nakhon Ratchasima, 30000, Thailand

* Corresponding Author: Rashid Jahangir. Email: email

Computers, Materials & Continua 2023, 75(2), 4085-4100. https://doi.org/10.32604/cmc.2023.036797

Received 12 October 2022; Accepted 30 January 2023; Issue published 31 March 2023

Abstract

Automatic Speaker Identification (ASI) involves the process of distinguishing an audio stream associated with numerous speakers’ utterances. Some common aspects, such as the framework difference, overlapping of different sound events, and the presence of various sound sources during recording, make the ASI task much more complicated and complex. This research proposes a deep learning model to improve the accuracy of the ASI system and reduce the model training time under limited computation resources. In this research, the performance of the transformer model is investigated. Seven audio features, chromagram, Mel-spectrogram, tonnetz, Mel-Frequency Cepstral Coefficients (MFCCs), delta MFCCs, delta-delta MFCCs and spectral contrast, are extracted from the ELSDSR, CSTR-VCTK, and Ar-DAD, datasets. The evaluation of various experiments demonstrates that the best performance was achieved by the proposed transformer model using seven audio features on all datasets. For ELSDSR, CSTR-VCTK, and Ar-DAD, the highest attained accuracies are 0.99, 0.97, and 0.99, respectively. The experimental results reveal that the proposed technique can achieve the best performance for ASI problems.

Keywords

Speaker identification; signal processing; Arabic; deep learning; transformer

Cite This Article

APA Style

Khan, A.A., Jahangir, R., Alroobaea, R., Alyahyan, S.Y., Almulhi, A.H. et al. (2023). An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model. Computers, Materials & Continua, 75(2), 4085–4100. https://doi.org/10.32604/cmc.2023.036797

Vancouver Style

Khan AA, Jahangir R, Alroobaea R, Alyahyan SY, Almulhi AH, Alsafyani M, et al. An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model. Comput Mater Contin. 2023;75(2):4085–4100. https://doi.org/10.32604/cmc.2023.036797

IEEE Style

A. A. Khan et al., “An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model,” Comput. Mater. Contin., vol. 75, no. 2, pp. 4085–4100, 2023. https://doi.org/10.32604/cmc.2023.036797

BibTex EndNote RIS

Copyright © 2023 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model

Abstract

Keywords

Cite This Article

1527

1068

1

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link