Controllability and Interpretability in Affective Speech Synthesis
In his thesis, Schnell aims to do so by incorporating prior knowledge about speech and the physiological production of it in the TTS framework. Schnell introduces well-established signal processing techniques to Neural Networks. Starting from emphasised speech he investigates the intonation production with a physiological plausible intonation model previously developed at Idiap. In order to generalise the model to longer prosodic sequences, he emulates a Spiking Neural Network (SNN) with a Recurrent Neural Network with trainable second-order recurrent elements trained with a learning function inspired from SNNs. The model synthesises neutral intonation with high naturalness and retains the physiological plausibility and controllability of the intonation model. After intonation, he looks into spectral features in the aspect of formant frequencies, which have shown to be indicators of certain emotions.
More Information
• Controllability and Interpretability in Affective Speech Synthesis
• Our Speech & Audio Processing group