Open Access iconOpen Access

ARTICLE

crossmark

Lip-Audio Modality Fusion for Deep Forgery Video Detection

Yong Liu1,4, Zhiyu Wang2,*, Shouling Ji3, Daofu Gong1,5, Lanxin Cheng1, Ruosi Cheng1

1 College of Cyberspace Security, Information Engineering University, Zhengzhou, 450001, China
2 Research Institute of Intelligent Networks, Zhejiang Lab, Hangzhou, 311121, China
3 College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China
4 Henan Key Laboratory of Cyberspace Situation Awareness, Zhengzhou, 450001, China
5 Key Laboratory of Cyberspace Security, Ministry of Education, Zhengzhou, 450001, China

* Corresponding Author: Zhiyu Wang. Email: email

(This article belongs to the Special Issue: Multimedia Security in Deep Learning)

Computers, Materials & Continua 2025, 82(2), 3499-3515. https://doi.org/10.32604/cmc.2024.057859

Abstract

In response to the problem of traditional methods ignoring audio modality tampering, this study aims to explore an effective deep forgery video detection technique that improves detection precision and reliability by fusing lip images and audio signals. The main method used is lip-audio matching detection technology based on the Siamese neural network, combined with MFCC (Mel Frequency Cepstrum Coefficient) feature extraction of band-pass filters, an improved dual-branch Siamese network structure, and a two-stream network structure design. Firstly, the video stream is preprocessed to extract lip images, and the audio stream is preprocessed to extract MFCC features. Then, these features are processed separately through the two branches of the Siamese network. Finally, the model is trained and optimized through fully connected layers and loss functions. The experimental results show that the testing accuracy of the model in this study on the LRW (Lip Reading in the Wild) dataset reaches 92.3%; the recall rate is 94.3%; the F1 score is 93.3%, significantly better than the results of CNN (Convolutional Neural Networks) and LSTM (Long Short-Term Memory) models. In the validation of multi-resolution image streams, the highest accuracy of dual-resolution image streams reaches 94%. Band-pass filters can effectively improve the signal-to-noise ratio of deep forgery video detection when processing different types of audio signals. The real-time processing performance of the model is also excellent, and it achieves an average score of up to 5 in user research. These data demonstrate that the method proposed in this study can effectively fuse visual and audio information in deep forgery video detection, accurately identify inconsistencies between video and audio, and thus verify the effectiveness of lip-audio modality fusion technology in improving detection performance.

Keywords

Deep forgery video detection; lip-audio modality fusion; mel frequency cepstrum coefficient; siamese neural network; band-pass filter

Cite This Article

APA Style
Liu, Y., Wang, Z., Ji, S., Gong, D., Cheng, L. et al. (2025). Lip-Audio Modality Fusion for Deep Forgery Video Detection. Computers, Materials & Continua, 82(2), 3499–3515. https://doi.org/10.32604/cmc.2024.057859
Vancouver Style
Liu Y, Wang Z, Ji S, Gong D, Cheng L, Cheng R. Lip-Audio Modality Fusion for Deep Forgery Video Detection. Comput Mater Contin. 2025;82(2):3499–3515. https://doi.org/10.32604/cmc.2024.057859
IEEE Style
Y. Liu, Z. Wang, S. Ji, D. Gong, L. Cheng, and R. Cheng, “Lip-Audio Modality Fusion for Deep Forgery Video Detection,” Comput. Mater. Contin., vol. 82, no. 2, pp. 3499–3515, 2025. https://doi.org/10.32604/cmc.2024.057859



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 515

    View

  • 238

    Download

  • 0

    Like

Share Link