Hearing Faces: Target Speaker Text-To-Speech Synthesis from a Face

IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) - Dec 2021
Associated documents :  
The existence of a learnable cross-modal association between a person’s face and their voice is recently becoming more and more evident. This provides the basis for the task of target speaker text-to-speech (TTS) synthesis from face reference. In this paper, we approach this task by proposing a cross-modal model architecture combining existing unimodal models. We use Tacotron 2 multi-speaker TTS with auditory speaker embeddings based on Global Style Tokens. We transfer learn a FaceNet face encoder to predict these embeddings from a static face image reference instead of a voice reference and thus predict a speaker’s voice and speaking characteristics from their face. Compared to Face2Speech, the only existing work on this task, we use a more modular architecture that allows the use of openly available and pretrained model components. This approach enables high-quality speech synthesis and allows for an easily extensible model architecture. Experimental results show good matching ability while retaining better voice naturalness than Face2Speech. We examine the limitations of our model and discuss multiple possible avenues of improvement for future work. Index Terms: multi-speaker text-to-speech synthesis, speaker embedding, cross-modal learning, transfer learning


 	 author =  {Plüster, Björn and Weber, Cornelius and Qu, Leyuan and Wermter, Stefan},  
 	 title = {Hearing Faces: Target Speaker Text-To-Speech Synthesis from a Face}, 
 	 booktitle = {IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
 	 number = {},
 	 volume = {},
 	 pages = {},
 	 year = {2021},
 	 month = {Dec},
 	 publisher = {IEEE},
 	 doi = {},