Xiliang Zhang1, Jin Liu1,*, Yue Li1, Zhongdai Wu2,3, Y. Ken Wang4
CMC-Computers, Materials & Continua, Vol.73, No.3, pp. 6407-6424, 2022, DOI:10.32604/cmc.2022.027097
- 28 July 2022
Abstract Performance of Video Question and Answer (VQA) systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’ answers. However, traditional linear combinations of multimodal features focus only on shallow feature interactions, fall far short of the need of deep feature fusion. Attention mechanisms were used to perform deep fusion, but most of them can only process weight assignment of single-modal information, leading to attention imbalance for different modalities. To address above problems, we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion More >