Open Access
ARTICLE
Stacked Attention Networks for Referring Expressions Comprehension
Yugang Li1, *, Haibo Sun1, Zhe Chen1, Yudan Ding1, Siqi Zhou2
1 Academy of Broadcasting Science, Beijing, 100866, China.
2 School of Electrical and Electronic Engineering, Nanyang Technological University, 639798, Singapore.
* Corresponding Author: Yugang Li. Email: .
Computers, Materials & Continua 2020, 65(3), 2529-2541. https://doi.org/10.32604/cmc.2020.011886
Received 03 June 2020; Accepted 03 July 2020; Issue published 16 September 2020
Abstract
Referring expressions comprehension is the task of locating the image region
described by a natural language expression, which refer to the properties of the region or
the relationships with other regions. Most previous work handles this problem by
selecting the most relevant regions from a set of candidate regions, when there are many
candidate regions in the set these methods are inefficient. Inspired by recent success of
image captioning by using deep learning methods, in this paper we proposed a framework
to understand the referring expressions by multiple steps of reasoning. We present a
model for referring expressions comprehension by selecting the most relevant region
directly from the image. The core of our model is a recurrent attention network which can
be seen as an extension of Memory Network. The proposed model capable of improving
the results by multiple computational hops. We evaluate the proposed model on two
referring expression datasets: Visual Genome and Flickr30k Entities. The experimental
results demonstrate that the proposed model outperform previous state-of-the-art methods
both in accuracy and efficiency. We also conduct an ablation experiment to show that the
performance of the model is not getting better with the increase of the attention layers.
Keywords
Cite This Article
Y. Li, H. Sun, Z. Chen, Y. Ding and S. Zhou, "Stacked attention networks for referring expressions comprehension,"
Computers, Materials & Continua, vol. 65, no.3, pp. 2529–2541, 2020.