Generalization in Multimodal Language Learning from Simulation
Proceedings of the International Joint Conference on Neural Networks (IJCNN 2021),
- Jul 2021
Neural networks can be powerful function approximators, which are able to model high-dimensional feature
distributions from a subset of examples drawn from the target
distribution. Naturally, they perform well at generalizing within
the limits of their target function, but they often fail to generalize
outside of the explicitly learned feature space. It is therefore
an open research topic whether and how neural network-based
architectures can be deployed for systematic reasoning. Many
studies have shown evidence for poor generalization, but they
often work with abstract data or are limited to single-channel
input. Humans, however, learn and interact through a combination of multiple sensory modalities, and rarely rely on just
one. To investigate compositional generalization in a multimodal
setting, we generate an extensible dataset with multimodal
input sequences from simulation. We investigate the influence
of the underlying training data distribution on compostional
generalization in a minimal LSTM-based network trained in
a supervised, time continuous setting. We find compositional
generalization to fail in simple setups while improving with the
number of objects, actions, and particularly with a lot of color
overlaps between objects. Furthermore, multimodality strongly
improves compositional generalization in settings where a pure
vision model struggles to generalize.
@InProceedings{ELWW21, author = {Eisermann, Aaron and Lee, Jae Hee and Weber, Cornelius and Wermter, Stefan}, title = {Generalization in Multimodal Language Learning from Simulation}, booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN 2021)}, editors = {}, number = {}, volume = {}, pages = {}, year = {2021}, month = {Jul}, publisher = {}, doi = {}, }