Mingyang Duan, Jin Liu*, Shiqi Lv
Journal on Big Data, Vol.3, No.2, pp. 77-83, 2021, DOI:10.32604/jbd.2021.016674
- 13 April 2021
Abstract Image caption generation is an essential task in computer vision and
image understanding. Contemporary image caption generation models usually
use the encoder-decoder model as the underlying network structure. However, in
the traditional Encoder-Decoder architectures, only the global features of the
images are extracted, while the local information of the images is not well
utilized. This paper proposed an Encoder-Decoder model based on fused features
and a novel mechanism for correcting the generated caption text. We use VGG16
and Faster R-CNN to extract global and local features in the encoder first. Then,
we train the bidirectional More >