End-to-End Audio Pattern Recognition Network for Overcoming Feature Limitations in Human-Machine Interaction

Zijian Sun^1,2, Yaqian Li^3,4,*, Haoran Liu^1,2, Haibin Li^3,4, Wenming Zhang^3,4
1 School of Information Science and Engineering, Yanshan University, Qinhuangdao, 066004, China
2 The Key Laboratory for Special Fiber and Fiber Sensor of Hebei Province, Yanshan University, Qinhuangdao, 066004, China
3 School of Electrical Engineering, Yanshan University, Qinhuangdao, 066004, China
4 Key Laboratory of Industrial Computer Control Engineering of Hebei Province, Yanshan University, Qinhuangdao, 066004, China
* Corresponding Author: Yaqian Li. Email: yql@stumail.ysu.edu.cn

Computers, Materials & Continua https://doi.org/10.32604/cmc.2025.061920

Received 06 December 2024; Accepted 18 February 2025; Published online 02 April 2025

Download PDF

Abstract

In recent years, audio pattern recognition has emerged as a key area of research, driven by its applications in human-computer interaction, robotics, and healthcare. Traditional methods, which rely heavily on handcrafted features such as Mel filters, often suffer from information loss and limited feature representation capabilities. To address these limitations, this study proposes an innovative end-to-end audio pattern recognition framework that directly processes raw audio signals, preserving original information and extracting effective classification features. The proposed framework utilizes a dual-branch architecture: a global refinement module that retains channel and temporal details and a multi-scale embedding module that captures high-level semantic information. Additionally, a guided fusion module integrates complementary features from both branches, ensuring a comprehensive representation of audio data. Specifically, the multi-scale audio context embedding module is designed to effectively extract spatiotemporal dependencies, while the global refinement module aggregates multi-scale channel and temporal cues for enhanced modeling. The guided fusion module leverages these features to achieve efficient integration of complementary information, resulting in improved classification accuracy. Experimental results demonstrate the model’s superior performance on multiple datasets, including ESC-50, UrbanSound8K, RAVDESS, and CREMA-D, with classification accuracies of 93.25%, 90.91%, 92.36%, and 70.50%, respectively. These results highlight the robustness and effectiveness of the proposed framework, which significantly outperforms existing approaches. By addressing critical challenges such as information loss and limited feature representation, this work provides new insights and methodologies for advancing audio classification and multimodal interaction systems.

Keywords

Audio pattern recognition; raw audio; end-to-end network; feature fusion

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

149

View
47

Download
0

Like

Rare Bird Sparse Recognition via Part-Based Gist Feature Fusion and Regularized Intraclass Dictionary Learning
Jixin Liu, Ning Sun, Xiaofei Li,...
Attention-Aware Network with Latent Semantic Analysis for Clothing Invariant Gait Recognition
Hefei Ling, Jia Wu, Ping Li, Jialie...
Deep Feature Fusion Model for Sentence Semantic Matching
Xu Zhang, Wenpeng Lu, Fangfang...
A Novel Combinational Convolutional Neural Network for Automatic Food-Ingredient Classification
Lili Pan, Cong Li, Samira Pouyanfar,...
A DDoS Attack Information Fusion Method Based on CNN for Multi-Element Data
Jieren Cheng, Canting Cai, Xiangyan...

All issues

Online First

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

End-to-End Audio Pattern Recognition Network for Overcoming Feature Limitations in Human-Machine Interaction

Abstract

Keywords

149

47

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link