Hearing Faces: Target Speaker Text-To-Speech Synthesis from a Face
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU),
- Dec 2021
The existence of a learnable cross-modal association between
a persons face and their voice is recently becoming more
and more evident. This provides the basis for the task of
target speaker text-to-speech (TTS) synthesis from face reference. In this paper, we approach this task by proposing a
cross-modal model architecture combining existing unimodal
models. We use Tacotron 2 multi-speaker TTS with auditory
speaker embeddings based on Global Style Tokens. We transfer learn a FaceNet face encoder to predict these embeddings
from a static face image reference instead of a voice reference
and thus predict a speakers voice and speaking characteristics
from their face. Compared to Face2Speech, the only existing
work on this task, we use a more modular architecture that
allows the use of openly available and pretrained model components. This approach enables high-quality speech synthesis
and allows for an easily extensible model architecture. Experimental results show good matching ability while retaining better voice naturalness than Face2Speech. We examine
the limitations of our model and discuss multiple possible avenues of improvement for future work.
Index Terms: multi-speaker text-to-speech synthesis, speaker
embedding, cross-modal learning, transfer learning
@InProceedings{PWQW21, author = {Plüster, Björn and Weber, Cornelius and Qu, Leyuan and Wermter, Stefan}, title = {Hearing Faces: Target Speaker Text-To-Speech Synthesis from a Face}, booktitle = {IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, editors = {}, number = {}, volume = {}, pages = {}, year = {2021}, month = {Dec}, publisher = {IEEE}, doi = {}, }