Neural Architectures for Speech Technology

Recent years have seen the replacement of many component technologies in speech processing with the deep-learning of deep neural networks (DNNs). Such systems typically perform much better than the older component technologies. The same time period has (not independently) seen such technologies move from academia to industry. Areas such as speech recognition and synthesis that were once regarded as research fields are now ubiquitous applications, typically in the context of the "GAFA" (Google, Apple, Facebook, Amazon) companies. These companies have led advances based on large data and computational resources, and more recently on end-to-end approaches. Responding to this trend, much academic research in speech technology has moved into what might be called peripheral technologies. In the case of the applicant here in Switzerland, that has meant focusing on geographical issues, in particular multilinguality and the closely related issues of paralinguistics and adaptation. Of course, many of the solutions to these issues lie in deep-learning; however, the data resources can be rather small, putting us in a position to compete with the GAFA companies. In research threads on multilingual recognition and emotional synthesis, we are finding that, in order to do deep-learning with such limited resources, it is helpful to cast techniques from signal processing, from bio-inspired computing and from Bayesian statistics into neural components. Integrated into neural networks, the components provide "explainability" that is not present in abstract sigmoid units; it is then clear how they might be adapted to variations in speaker and language. In NAST, the objective is to consolidate the two (application directed) research themes above into a single theme around neural architectures. Specifically, the EU H2020 project SUMMA, although geared towards multilingual speech recognition, is yielding results in Bayesian methods and recurrence. The SNSF project MASS, focusing on emotional synthesis, has cast muscle models as neural components. We intend to blur the distinction between recognition and synthesis since the proposed techniques are applicable to both; this also reflects theories of physiological processes. We aim to create what might be called a toolkit of neural techniques. This toolkit already contains rudimentary muscle models, initial Bayesian recurrence and vocal tract warping. A key feature of all the neural techniques is that they will be trainable in an end-to-end manner. This will allow them to be fully optimised in the context of the application at hand, be it recognition or synthesis of speech or emotion. In a first thread, we propose to develop the muscle models developed for intonation synthesis by driving them using spiking neurons. Whilst quite ambitious, the thread builds on initial work by a masters student, and is written with multiple chances to back off to more conventional techniques. Indeed, the most likely and influential outcome will be a hybrid of spiking and conventional neurons in a coherent framework. In a second, more incremental thread,we propose to consolidate the work of two doctoral students. In finishing their doctoral studies, they will provide neural components for the toolbox that will feed into their own work, that of the first thread above, and into a new task on factoring waveform synthesis. Each thread is written to allow interactions both within the thread and between the two, with many components being reused across tasks. In a technology sense, we may hope that the work will allow adaptation to speakers, to emotions and to languages with better quality and on smaller amounts of data. For particularly under-resourced or localised dialects, we hope to enable these capabilities where they would not otherwise exist. The resulting networks will have fewer parameters, allowing them to be smaller and faster. More generally, the tools will enable the concept of "explainability" in DNNs; rather than seeking meaning in networks of otherwise abstract activations, we provide activations that are fundamentally based on explainable processes. The concept of the toolbox, whilst being distributed across several open-source packages, will enable transfer of the technology to the academic community, to industrial collaborators here in Switzerland, and hopefully to the GAFA companies. The students will be complemented by several post-doctoral researchers in the Idiap speech group working on Innosuisse, EU and industrial projects; they will aid both the research and potential impact. In a more philosophical sense, we hope to build a bridge between the engineering of the GAFA and the speech "sciences" to which academic speech technology has often looked for inspiration.
Idiap Research Institute
SNSF
Feb 01, 2020
Sep 30, 2024