Image-to-Text Transduction with Spatial Self-Attention
Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN),
pages 43--48,
- Apr 2018
Attention mechanisms have been shown to improve recurrent
encoder-decoder architectures in sequence-to-sequence learning scenarios.
Recently, the Transformer model has been proposed which only applies
dot-product attention and omits recurrent operations to obtain a sourcetarget mapping [5]. In this paper we show that the concepts of self- and
inter-attention can effectively be applied in an image-to-text task. The
encoder applies pre-trained convolution and pooling operations followed
by self-attention to obtain an image feature representation. Self-attention
combines image features of regions based on their similarity before they
are made accessible to the decoder through inter-attention.
@InProceedings{SLWW18, author = {Springenberg, Sebastian and Lakomkin, Egor and Weber, Cornelius and Wermter, Stefan}, title = {Image-to-Text Transduction with Spatial Self-Attention}, booktitle = {Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN)}, editors = {}, number = {}, volume = {}, pages = {43--48}, year = {2018}, month = {Apr}, publisher = {i6doc}, doi = {}, }