Disentangling Prosody Representations with Unsupervised Speech Reconstruction
   
  
      IEEE/ACM Transactions on Audio, Speech, and Language Processing, 
   
   
  
      Volume 32,
    
    
  
      pages 39 - 54,
  
  
       doi: 10.1109/TASLP.2023.3320864 
  
   - Oct 2023
   
   
        
 
   
   
   
         
 
   
   
   
    Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in speech recognition and speaker verification tasks respectively. However, it is still an open challenging question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust speech recognition. The aim of this article is to address the disentanglement of emotional prosody based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain Prosody2Vec on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective and subjective evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations. Audio samples can be found on our demo website.

@Article{QLWPRW23,
 	 author =  {Qu, Leyuan and Li, Taihao and Weber, Cornelius and Pekarek-Rosin, Theresa and Ren, Fuji and Wermter, Stefan},
 	 title = {Disentangling Prosody Representations with Unsupervised Speech Reconstruction},
 	 booktitle = {}
 	 journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
 	 editors = {}
 	 number = {}
 	 volume = {32},
 	 pages = {39 - 54},
 	 year = {2023},
 	 month = {Oct},
 	 publisher = {IEEE},
 	 doi = {10.1109/TASLP.2023.3320864},
 }

