Learning Visually Grounded Human-Robot Dialog in a Hybrid Neural Architecture

Xiaowen Sun , Cornelius Weber , Matthias Kerzel , Tom Weber , Mengdi Li , Stefan Wermter

Artificial Neural Networks and Machine Learning – ICANN 2022, pages 258--269, doi: 10.1007/978-3-031-15931-2_22 - Sep 2022 Open Access

Associated documents :

Conducting a dialog in human-robot interaction (HRI) involves complexities that are hard to reconcile by individual research or engineering works. Towards the development of a robotic dialog agent, we develop a verbal and visual instruction scenario in which a robot needs to enter into a dialog to resolve ambiguities. We propose a novel hybrid neural architecture to learn the robotic part of the interaction. A neural dialog state tracker learns to process the user input depending on visual inputs and dialog instances. It uses variables to allow certain generality to generate the robotâs physical or verbal actions. We train it on a new visual dialog dataset, test different forms of input representations, and validate the robot agent on unseen examples. We evaluate our hybrid neural network approach in handling an HRI conversation scenario that is extendable to a real robot. Furthermore, we demonstrate that the hybrid approach allows generalization to a large range of unseen visual inputs and verbal instructions.

@InProceedings{SWKWLW22, 
 	 author =  {Sun, Xiaowen and Weber, Cornelius and Kerzel, Matthias and Weber, Tom and Li, Mengdi and Wermter, Stefan},  
 	 title = {Learning Visually Grounded Human-Robot Dialog in a Hybrid Neural Architecture}, 
 	 booktitle = {Artificial Neural Networks and Machine Learning – ICANN 2022},
 	 editors = {},
 	 number = {},
 	 volume = {},
 	 pages = {258--269},
 	 year = {2022},
 	 month = {Sep},
 	 publisher = {Springer International Publishing},
 	 doi = {10.1007/978-3-031-15931-2_22}, 
 }