Open-Vocabulary Robotic Object Manipulation using Foundation Models

ESANN 2024, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, doi: 10.5281/zenodo.14774343 - Apr 2025
Associated documents :  
Classical vision-language-action models are limited by unidirectional communication, hindering natural human-robot interaction. The recent CrossT5 embeds an efficient vision action pathway into an LLM, but lacks visual generalization, restricting actions to objects seen during training. We introduce OWL×T5, which integrates the OWLv2 object detection model into CrossT5 to enable robot actions on unseen objects. OWL×T5 is trained on a simulated dataset using the NICO humanoid robot and evaluated on the new CLAEO dataset featuring interactions with unseen objects. Results show that OWL×T5 achieves zero-shot object recognition for robotic manipulation, while efficiently integrating vision-language-action capabilities.

 

@InProceedings{GOWW25, 
 	 author =  {Griebenow, Stig and Özdemir, Ozan and Weber, Cornelius and Wermter, Stefan},  
 	 title = {Open-Vocabulary Robotic Object Manipulation using Foundation Models}, 
 	 booktitle = {ESANN 2024, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning},
 	 journal = {},
 	 editors = {},
 	 number = {},
 	 volume = {},
 	 pages = {},
 	 year = {2025},
 	 month = {Apr},
 	 publisher = {},
 	 doi = {10.5281/zenodo.14774343}, 
 }