Relation Also Need Attention: Integrating Relation Information Into Image Captioning

Tianyu Chen (Guangxi Normal University); Zhixin Li (Guangxi Normal University)*; tiantao xian (Guangxi Normal University); Canlong Zhang (Guangxi Normal University); Ma Huifang (Northwest Normal University)


Image captioning methods with attention mechanism are leading this field, especially models with global and local attention. But the relationship information between various regions and objects of the image are also very instructive for caption generation. For example, if there is a football in the image, there is a high probability that the image also contains people near the football. In this article, this kind of relationship features are embedded into the global local attention mechanism to explore the internal visual and semantic relationship between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we propose a new method to apply the Generative Adversarial Network into sequence generation. The greedy decoding method is used to generate an efficient baseline reward for self-critical training. Finally, experiments on MSCOCO dataset show that the model can generate more accurate and vivid captions and outperforms many recent advanced models in various prevailing evaluation metrics.