Emergence of Multimodal Action Representations from Neural Network Self-Organization
Cognitive Systems Research,
Volume 43,
doi: 10.1016/j.cogsys.2016.08.002
- Aug 2016
The integration of multisensory information plays a crucial role in autonomous robotics to forming robust and meaningful representations of the environment. In this work, we investigate how robust multimodal representations can naturally develop in a self-organizing
manner from co-occurring multisensory inputs. We propose a hierarchical architecture with growing self-organizing neural networks for
learning human actions from audiovisual inputs. The hierarchical processing of visual inputs allows to obtain progressively specialized
neurons encoding latent spatiotemporal dynamics of the input, consistent with neurophysiological evidence for increasingly large
temporal receptive windows in the human cortex. Associative links to bind unimodal representations are incrementally learned by a
semi-supervised algorithm with bidirectional connectivity. Multimodal representations of actions are obtained using the co-activation
of action features from video sequences and labels from automatic speech recognition. Experimental results on a dataset of 10 fullbody actions show that our system achieves state-of-the-art classification performance without requiring the manual segmentation of
training samples, and that congruent visual representations can be retrieved from recognized speech in the absence of visual stimuli.
Together, these results show that our hierarchical neural architecture accounts for the development of robust multimodal representations
from dynamic audiovisual inputs.
@Article{PTWW16, author = {Parisi, German I. and Tani, Jun and Weber, Cornelius and Wermter, Stefan}, title = {Emergence of Multimodal Action Representations from Neural Network Self-Organization}, journal = {Cognitive Systems Research}, number = {}, volume = {43}, pages = {}, year = {2016}, month = {Aug}, publisher = {Elsevier}, doi = {10.1016/j.cogsys.2016.08.002}, }