More Diverse Training, Better Compositionality! Evidence from Multimodal Language Learning
Proceedings of the 31st International Conference on Artificial Neural Networks (ICANN 2022),
pages 417--428,
doi: 10.1007/978-3-031-15934-3_35
- Sep 2022
Artificial neural networks still fall short of human-level generalization and require a very large number of training examples to succeed. Model architectures that further improve generalization capabilities are therefore still an open research question. We created a multimodal dataset from simulation for measuring the compositional generalization of neural networks in multimodal language learning. The dataset consists of sequences showing a robot arm interacting with objects on a table in a simple 3D environment, with the goal of describing the interaction. Compositional object features, multiple actions, and distracting objects pose challenges to the model. We show that an LSTM-encoder-decoder architecture jointly trained together with a vision-encoder surpasses previous performance and handles multiple visible objects. Visualization of important input dimensions shows that a model that is trained with multiple objects, but not a model trained on just one object, has learnt to ignore irrelevant objects. Furthermore we show that additional modalities in the input improve the overall performance. We conclude that the underlying training data has a significant influence on the modelâs capability to generalize compositionally.
@InProceedings{VLWW22, author = {Volquardsen, Caspar and Lee, Jae Hee and Weber, Cornelius and Wermter, Stefan}, title = {More Diverse Training, Better Compositionality! Evidence from Multimodal Language Learning}, booktitle = {Proceedings of the 31st International Conference on Artificial Neural Networks (ICANN 2022)}, editors = {}, number = {}, volume = {}, pages = {417--428}, year = {2022}, month = {Sep}, publisher = {Springer International Publishing}, doi = {10.1007/978-3-031-15934-3_35}, }