Controllability and Interpretability in Affective Speech Synthesis

In his research work, Idiap student, Bastian Schnell believes that affective TTS can be enabled with models which generalise better to the variability in speech thanks to components which are interpretable by humans.

In his thesis, Schnell aims to do so by incorporating prior knowledge about speech and the physiological production of it in the TTS framework. Schnell introduces well-established signal processing techniques to Neural Networks. Starting from emphasised speech he investigates the intonation production with a physiological plausible intonation model previously developed at Idiap. In order to generalise the model to longer prosodic sequences, he emulates a Spiking Neural Network (SNN) with a Recurrent Neural Network with trainable second-order recurrent elements trained with a learning function inspired from SNNs. The model synthesises neutral intonation with high naturalness and retains the physiological plausibility and controllability of the intonation model. After intonation, he looks into spectral features in the aspect of formant frequencies, which have shown to be indicators of certain emotions.

 


More Information
Controllability and Interpretability in Affective Speech Synthesis
Our Speech & Audio Processing group