Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Emotion Recognition from Speech

Hussam Almotlak , Cornelius Weber , Leyuan Qu , Stefan Wermter

International Conference on Artificial Neural Networks (ICANN), Editors: Igor Farkaš, Paolo Masulli, Stefan Wermter, Volume LNCS 12396, pages 529-540, doi: 10.1007/978-3-030-61609-0_42 - Oct 2020

Associated documents :

Unsupervised learning is based on the idea of self&hyphen;organization to find hidden patterns and features in the data without the need for labels. Variational autoencoders (VAEs) are generative unsupervised learning models that create low&hyphen;dimensional representations of the input data and learn by regenerating the same input from that representation. Recently, VAEs were used to extract representations from audio data, which possess not only content&hyphen;dependent information but also speaker&hyphen;dependent information such as gender, health status, and speaker ID. VAEs with two timescale variables were then introduced to disentangle these two kinds of information from each other. Our approach introduces a third, i.e. medium timescale into a VAE. So instead of having only a global and a local timescale variable, this model holds a global, a medium, and a local variable. We tested the model on three downstream applications: speaker identification, gender classification, and emotion recognition, where each hidden representation performed better on some specific tasks than the other hidden representations. Speaker ID and gender were best reported by the global variable, while emotion was best extracted when using the medium. Our model achieves excellent results exceeding state&hyphen;of&hyphen;the&hyphen;art models on speaker identification and emotion regression from audio.

@InProceedings{AWQW20, 
 	 author =  {Almotlak, Hussam and Weber, Cornelius and Qu, Leyuan and Wermter, Stefan},  
 	 title = {Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Emotion Recognition from Speech}, 
 	 booktitle = {International Conference on Artificial Neural Networks (ICANN)},
 	 editors = {Igor Farkaš, Paolo Masulli, Stefan Wermter},
 	 number = {},
 	 volume = {LNCS 12396},
 	 pages = {529-540},
 	 year = {2020},
 	 month = {Oct},
 	 publisher = {Springer},
 	 doi = {10.1007/978-3-030-61609-0_42}, 
 }