Disentanglement of Prosody Representations via Diffusion Models and Scheduled Gradient Reversal
IEEE Transactions on Neural Networks and Learning Systems,
pages 1 - 12,
doi: 10.1109/TNNLS.2025.3534822
- Feb 2025
Prosody plays a fundamental role in human speech and communication, facilitating intelligibility and conveying emotional and cognitive states. Extracting accurate prosodic information from speech is vital for building assistive technology, such as controllable speech synthesis, speaking style transfer, and speech emotion recognition (SER). However, it is challenging to disentangle speaker-independent prosody representations since prosodic attributes, such as intonation, excessively entangle with speaker-specific attributes, e.g., pitch. In this article, we propose a novel model, called Diffsody, to disentangle and refine prosody representations: 1) to disentangle prosody representations, we leverage the expressive generative ability of a diffusion model by conditioning it on quantified semantic information and pretrained speaker embeddings. Additionally, a prosody encoder automatically learns prosody representations used for spectrogram reconstruction in an unsupervised fashion; and 2) to refine and learn speaker-invariant prosody representations, a scheduled gradient reversal layer (sGRL) is proposed and integrated into the prosody encoder of Diffsody. We extensively evaluate Diffsody through qualitative and quantitative means. t-SNE visualization and speaker verification experiments demonstrate the efficacy of the sGRL method in preventing speaker-specific information leakage. Experimental results on speaker-independent SER and automatic depression detection (ADD) tasks demonstrate that Diffsody can efficiently factorize speaker-independent prosody representations, resulting in a significant boost in SER and ADD. In addition, Diffsody synergistically integrates with the semantic representation model WavLM, which leads to a discernibly elevated performance, outperforming contemporary methods in both SER and ADD tasks. Furthermore, the Diffsody model exhibits promising potential for various practical applications, such as voice or style conversion. Some audio samples can be found on our https://leyuanqu.github.io/Diffsody/demo website.

@Article{QWWJGLW25, author = {Qu, Leyuan and Weber, Cornelius and Wang, Wei and Jin, Jia and Gao, Yingming and Li, Taihao and Wermter, Stefan}, title = {Disentanglement of Prosody Representations via Diffusion Models and Scheduled Gradient Reversal}, booktitle = {}, journal = {IEEE Transactions on Neural Networks and Learning Systems}, editors = {}, number = {}, volume = {}, pages = {1 - 12}, year = {2025}, month = {Feb}, publisher = {}, doi = {10.1109/TNNLS.2025.3534822}, }