Comparing Apples to Oranges: LLM-Powered Multimodal Intention Prediction in an Object Categorization Task

Social Robotics, pages 292–306, doi: 10.1007/978-981-96-3525-2_25 - Mar 2025 Open Access
Associated documents :  
Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot. Video: https://youtu.be/tBJHfAuzohI

 

@InProceedings{AAW25a, 
 	 author =  {Ali, Hassan and Allgeuer, Philipp and Wermter, Stefan},  
 	 title = {Comparing Apples to Oranges: LLM-Powered Multimodal Intention Prediction in an Object Categorization Task}, 
 	 booktitle = {Social Robotics},
 	 journal = {},
 	 editors = {},
 	 number = {},
 	 volume = {},
 	 pages = {292–306},
 	 year = {2025},
 	 month = {Mar},
 	 publisher = {},
 	 doi = {10.1007/978-981-96-3525-2_25}, 
 }