Open Access
ARTICLE
Encoder-Decoder Based Multi-Feature Fusion Model for Image Caption Generation
Mingyang Duan, Jin Liu*, Shiqi Lv
Shanghai Maritime University, Shanghai, 201306, China
* Corresponding Author: Jin Liu. Email:
Journal on Big Data 2021, 3(2), 77-83. https://doi.org/10.32604/jbd.2021.016674
Received 08 January 2021; Accepted 07 April 2021; Issue published 13 April 2021
Abstract
Image caption generation is an essential task in computer vision and
image understanding. Contemporary image caption generation models usually
use the encoder-decoder model as the underlying network structure. However, in
the traditional Encoder-Decoder architectures, only the global features of the
images are extracted, while the local information of the images is not well
utilized. This paper proposed an Encoder-Decoder model based on fused features
and a novel mechanism for correcting the generated caption text. We use VGG16
and Faster R-CNN to extract global and local features in the encoder first. Then,
we train the bidirectional LSTM network with the fused features in the decoder.
Finally, the local features extracted is used to correct the caption text. The
experiment results prove that the effectiveness of the proposed method.
Keywords
Cite This Article
M. Duan, J. Liu and S. Lv, "Encoder-decoder based multi-feature fusion model for image caption generation,"
Journal on Big Data, vol. 3, no.2, pp. 77–83, 2021. https://doi.org/10.32604/jbd.2021.016674