Ziwei Tang1, Yaohua Yi2,*, Changhui Yu2, Aiguo Yin3
CMC-Computers, Materials & Continua, Vol.75, No.3, pp. 6007-6022, 2023, DOI:10.32604/cmc.2023.037861
- 29 April 2023
Abstract Existing image captioning models usually build the relation between visual information and words to generate captions, which lack spatial information and object classes. To address the issue, we propose a novel Position-Class Awareness Transformer (PCAT) network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes. In our proposal, we construct our PCAT network by proposing a novel Grid Mapping Position Encoding (GMPE) method and refining the encoder-decoder framework. First, GMPE includes mapping the regions of objects to grids, calculating the relative distance among… More >