On June 20, 2013, David Imseng successfully defended his PhD thesis entitled "Multilingual speech recognition - a posterior based approach"

Modern automatic speech recognition (ASR) systems are based on parametric statistical models such as hidden Markov models (HMMs), exploiting 1) acoustic-phonetic models, which need to be trained on large amount of acoustic data, 2) a language model, which needs to be trained on large amount of text data and, finally, 3) a lexicon with phonetic transcription which requires linguistic expertise. Developing multilingual ASR systems, or systems that are robust to accents and dialects, is therefore a very challenging task for current state-of-the-art ASR systems.

In this thesis, we focus on investigating acoustic-phonetic modeling and lexical diversity across languages and databases, and assume that a language model is available. In our case, this is done in the context of hybrid HMM/MLP ASR, where the HMMemission probabilities are modeled as posterior probabilities ofHMMstates, conditioned on the acoustics, estimated at the output of a multilayer perceptron (MLP). We build upon a recently proposed acoustic modeling approach, referred to as KL-HMM, where posterior probabilities are directly used as acoustic features, and where the HMMstates are directly parametrized by trained posterior probabilities. The set of HMM reference posteriors is then estimated by minimizing the Kullback–Leibler divergence between posterior features extracted from the training data and reference posteriors.

The proposed KL-HMMmodel is extensively developed and adapted to tackle several challenging problems related to multilingual ASR, including lexical diversity, stochastic phone space transformations, accented speech recognition and using multilingual data resources to boost monolingual systems. The efficiency of the proposed approach is demonstrated through theoretical and experimental comparisons with similar approaches such as probabilistic acoustic mapping, linear hidden networks and maximum a posteriori adaptation. Furthermore, KL-HMMis also compared with other posterior feature based ASR techniques such as Tandem and short-termspectral feature based approaches such as subspace Gaussian mixture models. The comparison reveals that the KL-HMMframework is a suitable alternative to conventional acoustic modeling techniques and seems to be preferable in low amount of data as well as phoneme set mismatch scenarios.

Keywords Multilingual speech recognition, multilingual acoustic modeling, posterior features, KL-HMM, non-native speech recognition, under-resourced languages

Get the full paper here:
Multilingual speech recognition A posterior based approach