Image-to-Text Transduction with Spatial Self-Attention

Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), pages 43--48, - Apr 2018
Associated documents :  
Attention mechanisms have been shown to improve recurrent encoder-decoder architectures in sequence-to-sequence learning scenarios. Recently, the Transformer model has been proposed which only applies dot-product attention and omits recurrent operations to obtain a sourcetarget mapping [5]. In this paper we show that the concepts of self- and inter-attention can effectively be applied in an image-to-text task. The encoder applies pre-trained convolution and pooling operations followed by self-attention to obtain an image feature representation. Self-attention combines image features of regions based on their similarity before they are made accessible to the decoder through inter-attention.

 

@InProceedings{SLWW18, 
 	 author =  {Springenberg, Sebastian and Lakomkin, Egor and Weber, Cornelius and Wermter, Stefan},  
 	 title = {Image-to-Text Transduction with Spatial Self-Attention}, 
 	 booktitle = {Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN)},
 	 editors = {},
 	 number = {},
 	 volume = {},
 	 pages = {43--48},
 	 year = {2018},
 	 month = {Apr},
 	 publisher = {i6doc},
 	 doi = {}, 
 }