Open Access iconOpen Access

ARTICLE

crossmark

Generative Multi-Modal Mutual Enhancement Video Semantic Communications

by Yuanle Chen1, Haobo Wang1, Chunyu Liu1, Linyi Wang2, Jiaxin Liu1, Wei Wu1,*

1 The College of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
2 The College of Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China

* Corresponding Author: Wei Wu. Email: email

(This article belongs to the Special Issue: Machine Learning Empowered Distributed Computing: Advance in Architecture, Theory and Practice)

Computer Modeling in Engineering & Sciences 2024, 139(3), 2985-3009. https://doi.org/10.32604/cmes.2023.046837

Abstract

Recently, there have been significant advancements in the study of semantic communication in single-modal scenarios. However, the ability to process information in multi-modal environments remains limited. Inspired by the research and applications of natural language processing across different modalities, our goal is to accurately extract frame-level semantic information from videos and ultimately transmit high-quality videos. Specifically, we propose a deep learning-based Multi-Modal Mutual Enhancement Video Semantic Communication system, called M3E-VSC. Built upon a Vector Quantized Generative Adversarial Network (VQGAN), our system aims to leverage mutual enhancement among different modalities by using text as the main carrier of transmission. With it, the semantic information can be extracted from key-frame images and audio of the video and perform differential value to ensure that the extracted text conveys accurate semantic information with fewer bits, thus improving the capacity of the system. Furthermore, a multi-frame semantic detection module is designed to facilitate semantic transitions during video generation. Simulation results demonstrate that our proposed model maintains high robustness in complex noise environments, particularly in low signal-to-noise ratio conditions, significantly improving the accuracy and speed of semantic transmission in video communication by approximately 50 percent.

Keywords


Cite This Article

APA Style
Chen, Y., Wang, H., Liu, C., Wang, L., Liu, J. et al. (2024). Generative multi-modal mutual enhancement video semantic communications. Computer Modeling in Engineering & Sciences, 139(3), 2985-3009. https://doi.org/10.32604/cmes.2023.046837
Vancouver Style
Chen Y, Wang H, Liu C, Wang L, Liu J, Wu W. Generative multi-modal mutual enhancement video semantic communications. Comput Model Eng Sci. 2024;139(3):2985-3009 https://doi.org/10.32604/cmes.2023.046837
IEEE Style
Y. Chen, H. Wang, C. Liu, L. Wang, J. Liu, and W. Wu, “Generative Multi-Modal Mutual Enhancement Video Semantic Communications,” Comput. Model. Eng. Sci., vol. 139, no. 3, pp. 2985-3009, 2024. https://doi.org/10.32604/cmes.2023.046837



cc Copyright © 2024 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 772

    View

  • 348

    Download

  • 1

    Like

Share Link