Open Access iconOpen Access

ARTICLE

crossmark

MVCE-Net: Multi-View Region Feature and Caption Enhancement Co-Attention Network for Visual Question Answering

Feng Yan1, Wushouer Silamu2, Yanbing Li1,*

1 School of Information Science and Engineering, Xinjiang University, Urumqi, 830046, China
2 Xinjiang Key Laboratory of Multilingual Information Technology, Urumqi, 830046, China

* Corresponding Author: Yanbing Li. Email: email

Computers, Materials & Continua 2023, 76(1), 65-80. https://doi.org/10.32604/cmc.2023.038177

Abstract

Visual question answering (VQA) requires a deep understanding of images and their corresponding textual questions to answer questions about images more accurately. However, existing models tend to ignore the implicit knowledge in the images and focus only on the visual information in the images, which limits the understanding depth of the image content. The images contain more than just visual objects, some images contain textual information about the scene, and slightly more complex images contain relationships between individual visual objects. Firstly, this paper proposes a model using image description for feature enhancement. This model encodes images and their descriptions separately based on the question-guided co-attention mechanism. This mechanism increases the feature representation of the model, enhancing the model’s ability for reasoning. In addition, this paper improves the bottom-up attention model by obtaining two image region features. After obtaining the two visual features and the spatial position information corresponding to each feature, concatenating the two features as the final image feature can better represent an image. Finally, the obtained spatial position information is processed to enable the model to perceive the size and relative position of each object in the image. Our best single model delivers a 74.16% overall accuracy on the VQA 2.0 dataset, our model even outperforms some multi-modal pre-training models with fewer images and a shorter time.

Keywords


Cite This Article

F. Yan, W. Silamu and Y. Li, "Mvce-net: multi-view region feature and caption enhancement co-attention network for visual question answering," Computers, Materials & Continua, vol. 76, no.1, pp. 65–80, 2023. https://doi.org/10.32604/cmc.2023.038177



cc This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 802

    View

  • 449

    Download

  • 0

    Like

Share Link