Open-Vocabulary Robotic Object Manipulation using Foundation Models
ESANN 2024, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning,
doi: 10.5281/zenodo.14774343
- Apr 2025
Classical vision-language-action models are limited by unidirectional communication, hindering natural human-robot interaction. The recent CrossT5 embeds an efficient vision action pathway into an LLM, but lacks visual generalization, restricting actions to objects seen during training. We introduce OWL×T5, which integrates the OWLv2 object detection model into CrossT5 to enable robot actions on unseen objects. OWL×T5 is trained on a simulated dataset using the NICO humanoid robot and evaluated on the new CLAEO dataset featuring interactions with unseen objects. Results show that OWL×T5 achieves zero-shot object recognition for robotic manipulation, while efficiently integrating vision-language-action capabilities.

@InProceedings{GOWW25, author = {Griebenow, Stig and Özdemir, Ozan and Weber, Cornelius and Wermter, Stefan}, title = {Open-Vocabulary Robotic Object Manipulation using Foundation Models}, booktitle = {ESANN 2024, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning}, journal = {}, editors = {}, number = {}, volume = {}, pages = {}, year = {2025}, month = {Apr}, publisher = {}, doi = {10.5281/zenodo.14774343}, }