MVCE-Net: Multi-View Region Feature and Caption Enhancement Co-Attention Network for Visual Question Answering

Feng Yan; Wushouer Silamu; Yanbing Li

doi:10.32604/cmc.2023.038177

Open Access icon Open Access

ARTICLE

MVCE-Net: Multi-View Region Feature and Caption Enhancement Co-Attention Network for Visual Question Answering

Feng Yan¹, Wushouer Silamu², Yanbing Li^1,*

1 School of Information Science and Engineering, Xinjiang University, Urumqi, 830046, China
2 Xinjiang Key Laboratory of Multilingual Information Technology, Urumqi, 830046, China

* Corresponding Author: Yanbing Li. Email: email

Computers, Materials & Continua 2023, 76(1), 65-80. https://doi.org/10.32604/cmc.2023.038177

Received 30 November 2022; Accepted 17 February 2023; Issue published 08 June 2023

Abstract

Visual question answering (VQA) requires a deep understanding of images and their corresponding textual questions to answer questions about images more accurately. However, existing models tend to ignore the implicit knowledge in the images and focus only on the visual information in the images, which limits the understanding depth of the image content. The images contain more than just visual objects, some images contain textual information about the scene, and slightly more complex images contain relationships between individual visual objects. Firstly, this paper proposes a model using image description for feature enhancement. This model encodes images and their descriptions separately based on the question-guided co-attention mechanism. This mechanism increases the feature representation of the model, enhancing the model’s ability for reasoning. In addition, this paper improves the bottom-up attention model by obtaining two image region features. After obtaining the two visual features and the spatial position information corresponding to each feature, concatenating the two features as the final image feature can better represent an image. Finally, the obtained spatial position information is processed to enable the model to perceive the size and relative position of each object in the image. Our best single model delivers a 74.16% overall accuracy on the VQA 2.0 dataset, our model even outperforms some multi-modal pre-training models with fewer images and a shorter time.

Keywords

Bottom-up attention; spatial position relationship; region feature; self-attention

Cite This Article

APA Style

Yan, F., Silamu, W., Li, Y. (2023). MVCE-Net: Multi-View Region Feature and Caption Enhancement Co-Attention Network for Visual Question Answering. Computers, Materials & Continua, 76(1), 65–80. https://doi.org/10.32604/cmc.2023.038177

Vancouver Style

Yan F, Silamu W, Li Y. MVCE-Net: Multi-View Region Feature and Caption Enhancement Co-Attention Network for Visual Question Answering. Comput Mater Contin. 2023;76(1):65–80. https://doi.org/10.32604/cmc.2023.038177

IEEE Style

F. Yan, W. Silamu, and Y. Li, “MVCE-Net: Multi-View Region Feature and Caption Enhancement Co-Attention Network for Visual Question Answering,” Comput. Mater. Contin., vol. 76, no. 1, pp. 65–80, 2023. https://doi.org/10.32604/cmc.2023.038177

BibTex EndNote RIS

Copyright © 2023 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

MVCE-Net: Multi-View Region Feature and Caption Enhancement Co-Attention Network for Visual Question Answering

Abstract

Keywords

Cite This Article

1467

1100

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link