Image Captioning with X-Gelu Activated Xgl Transformer

2022; RELX Group (Netherlands); Linguagem: Inglês

10.2139/ssrn.4127610

ISSN

1556-5068

Autores

Dhruv Sharma, Chhavi Dhiman, Dinesh Kumar,

Tópico(s)

Video Analysis and Summarization

Resumo

Image captioning aims to extract multiple semantic features from an image and integrates them into a sentence-level description. For efficient description of the captions, it is necessary to learn higher order interactions between detected objects and the relationship among them. Most of the existing systems take into account the first order interactions while ignoring the higher order ones. In this paper, an efficient higher order interaction learning framework is proposed using encoderdecoder based image captioning framework. An efficient XGL Transformer (XGLT) is defined that integrates four XGL attention modules in image encoder and one at the sentence decoder. The XGL attention module uses low-rank bilinear pooling, x-GELU activation function and Skip-Squeeze and Excitation (SSE) network to leverage higher order interactions among multiple objects by exploiting spatial and channel-wise attention. The proposed model is validated by conducting extensive experiments on publicly available MSCOCO dataset that exhibit the efficacy of the proposed XGL-T model by outperforming other state-of-the-art significantly, as reported by BLEU, METEOR, CIDEr, ROUGE-L and SPICE evaluation metrics. An ablation study is also carried out to support the experimental results.

Referência(s)
Altmetric
PlumX