PAuSE: Pathological Speech Enhancement

Digital communication plays a vital role in our daily lives, occurring in many different forms including mobile communication, meeting through teleconferencing platforms such as Zoom, and using voice-activated agents such as Apple’s Siri or Amazon’s Alexa. Since we live in a noisy world, the recorded microphone signals in these digital communication applications are often contaminated by background noise. Background noise causes signal degradation, thereby impairing speech quality and intelligibility and decreasing the performance of many signal processing techniques required in several applications for a high fidelity communication experience. To deal with background noise, speech enhancement approaches aiming to recover the clean speech signal are indispensable. As such, wide range of speech enhancement approaches have been proposed in past decades, with e.g., the traditional statistical model-based approaches and the more recent deep learning-based approaches having shown strong enhancement performance. Statistical model-based approaches generally rely on a Bayesian framework and assume that the clean speech spectral coefficients follow a Gaussian or super-Gaussian statistical distribution. Deep learning-based approaches generally rely on a large amount of training data to learn a mapping from noisy signals to clean signals. Although these approaches have shown advantageous enhancement performance, they have been devised for scenarios where the target speakers are neurotypical speakers, i.e., speakers that do not exhibit any speech impairments. However, many pathological conditions such as hearing loss, head and neck cancers, or neurological disorders, disrupt the speech production mechanism, resulting in speech impairments across different dimensions. As a result, the statistical distribution of pathological speech differs from that of neurotypical speech and preliminary investigations show that state-of-the-art enhancement approaches can yield a considerably lower performance for pathological signals than for neurotypical signals. Although conditions resulting in pathological speech are widely prevalent, speech enhancement approaches specifically targeting pathological speech, e.g., through using appropriate statistical distributions or training data, have never been established. The PAuSE project aims at developing model-based and deep learning-based speech enhancement approaches that yield an advantageous performance for pathological speech. For model-based approaches, we will derive Minimum Mean Square Error and Maximum A Posteriori estimators exploiting appropriate statistical distributions for pathological speech signals. For deep learning-based approaches, we will target two different research directions to deal with the lack of extensive pathological speech training data. First, we will develop approaches which rely only on neurotypical training signals but exploit knowledge of pathology-specific acoustic features that impact enhancement performance. Such approaches will be based on feature-aware networks, where the feature information is directly embedded in the network or different networks are specialized to enhance signals with different feature profiles. Second, we will develop approaches that also exploit the typically scarcely available pathological training data. Such approaches will be based on pathology-aware networks, augmentation strategies specifically targeting pathological speech, and transfer learning and adaptation strategies. The conducted research will provide a better understanding of the deterioration of enhancement performance for pathological signals. Combining this knowledge with novel estimators and training frameworks will produce enhancement approaches with a strong performance for the growing group of pathological speakers.
Idiap Research Institute
SNSF
Apr 01, 2024
Mar 31, 2028