Enabling action crossmodality for a pretrained large language model

Anton Caesar , Ozan Özdemir , Cornelius Weber , Stefan Wermter

Natural Language Processing Journal, Volume 7, pages 100072, doi: 10.1016/j.nlp.2024.100072 - Jun 2024 Open Access

Associated documents :

Natural language processing and vision tasks have recently seen large improvements through the rise of Transformer architectures. The high-performing large language models (LLMs) benefit from large textual datasets that are numerously available online. However, action and bidirectional action-language tasks are less developed, as these require more specific and labeled data. Therefore, we aim at enabling these robotic action capabilities for a pretrained LLM, while maintaining high efficiency with regards to the required training time and data size. To achieve this, we split up a Transformer-based LLM and insert a multimodal architecture into it. Specifically, we split a pretrained T5 LLM between its encoder and decoder parts, to insert a crossmodal Transformer component of a Paired Transformed Autoencoders (PTAE) bidirectional action-language model. The experiments are conducted on a new dataset, consisting of unimodal language translation and crossmodal bidirectional action-language translation. The natural language capabilities of the original T5 are re-established efficiently by training the crossmodal Transformer, which requires only one 5.7 millionth of the T5 models original training data. Furthermore, the new model, called CrossT5, achieves high accuracy for the vision- and language-guided robotic action tasks. By design, the CrossT5 agent acts robustly when tested with language commands not included in the dataset. The results demonstrate that this novel approach is successful in combining the advanced linguistic capabilities of LLMs with the low-level robotic control skills of vision-action models. The code is available at this URL: https://github.com/samsoneko/CrossT5

@Article{COWW24a,
 	 author =  {Caesar, Anton and Özdemir, Ozan and Weber, Cornelius and Wermter, Stefan},
 	 title = {Enabling action crossmodality for a pretrained large language model},
 	 booktitle = {}
 	 journal = {Natural Language Processing Journal},
 	 editors = {}
 	 number = {}
 	 volume = {7},
 	 pages = {100072},
 	 year = {2024},
 	 month = {Jun},
 	 publisher = {Elsevier B.V.},
 	 doi = {10.1016/j.nlp.2024.100072},
 }