Image Captioning with X-Gelu Activated Xgl Transformer
2022; RELX Group (Netherlands); Linguagem: Inglês
10.2139/ssrn.4127610
ISSN1556-5068
AutoresDhruv Sharma, Chhavi Dhiman, Dinesh Kumar,
Tópico(s)Video Analysis and Summarization
ResumoImage captioning aims to extract multiple semantic features from an image and integrates them into a sentence-level description. For efficient description of the captions, it is necessary to learn higher order interactions between detected objects and the relationship among them. Most of the existing systems take into account the first order interactions while ignoring the higher order ones. In this paper, an efficient higher order interaction learning framework is proposed using encoderdecoder based image captioning framework. An efficient XGL Transformer (XGLT) is defined that integrates four XGL attention modules in image encoder and one at the sentence decoder. The XGL attention module uses low-rank bilinear pooling, x-GELU activation function and Skip-Squeeze and Excitation (SSE) network to leverage higher order interactions among multiple objects by exploiting spatial and channel-wise attention. The proposed model is validated by conducting extensive experiments on publicly available MSCOCO dataset that exhibit the efficacy of the proposed XGL-T model by outperforming other state-of-the-art significantly, as reported by BLEU, METEOR, CIDEr, ROUGE-L and SPICE evaluation metrics. An ablation study is also carried out to support the experimental results.
Referência(s)