Home

 

 

 

 

Idiap Research Institute
Centre du Parc
Rue Marconi 19
PO Box 592
CH - 1920 Martigny
Switzerland

T +41 27 721 77 11
F +41 27 721 77 12

Contact form

 

(Archives 2011) RECOD - Low bit-rate speech coding

Very low bit rate speech coding demonstration page

Progress report of a project in very low bit-rate speech coding

Within the last two decades, automatic speech recognition (ASR) and text to speech (TTS) technologies have almost completely converged around a single paradigm: the hidden Markov model (HMM). The HMM framework is almost completely data-driven. That is, it responds automatically to data with little human interaction required. In general the peripheral technologies, such as speech coding, advantageously share the HMMs' data driven capabilities. They allow, for example, tuning to a particular user after a few minutes.

Speech coding with very low bit rates can be achieved by integration of ASR and TTS, where a sequence of symbols, such as phonemes, is transmitted instead of compressed audio signal. This demonstration page presents two of such systems.

HNM based system:
The first system uses Harmonic plus Noise Model (HNM) synthesis that performs a speaker-dependent unit selection on the encoder side and an overlap-and-add (OLA) post-processing on the decoder side. Whilst unit selection also responds well to training data, as it is case of ASR, an adaptation to new speakers (voices) is more difficult. The HNM coding system thus can offer language independent, but speaker dependent speech-to-speech transmission.

HNM based system:
The second speech coding uses HMM speech synthesis (H-triple-S, or HTS). The HMM approach can perform as well as unit selection, but has all the advantages of statistical rigour and peripheral technologies as for ASR. For instance, distinct voices have been produced from small amounts of adaptation data. The HMM coding system thus offers language dependent, and speaker independent system. HMM speaker independence is achieved by using model adaptation techniques.

Transmission-rates:
Theoretical low limit of the transmission (bit) rate is around 50 bits per second (bps). However, in order to acquire language and speaker independence in the coding systems, an additional information needs to be transmitted, increasing the bit-rate to around 200-300 bps.

Language Systems Examples: HNM bestdtw HNM bestcat HTS average HTS VTLN adapt (1 utt) HTS CSMAPLR adapt (25 utts)
English Original voice
HMM coding (50 bps) (50 bps) (200-300 bps)
HNM coding (100 bps) (100 bps)
Valesian German Original voice
HNM coding (100 bps) (100 bps)
French Original voice
HNM coding (100 bps) (100 bps)

Hidden Markov Model (HMM) speech coding

HTS technique is a new TTS paradigm that has emerged based on ASR technology, and can be thought of as an inversion of an HMM that allows speech to be synthesized as well as recognized. Although the HMM and HTS paradigms unify the general theory of ASR and TTS, and there is still a significant practical gap between the two approaches, they can be integrated into an elegant solution of very low bit-rate speech coding. Voice adaptation in HTS starts with HMMs trained on many speakers (HTS average) and uses HMM adaptation techniques drawn from speech recognition, to adapt the models to a new speaker (of the same language and with the same accent). Two adaptation are presented:

  1. The Vocal Tract Length Normalisation (VTLN) adaptation that requires negligible information to be transmitted, but adaptation performance is rather weak (results are observable when we adapt voices of different genders).
  2. Constrained Structural Maximum A Posteriori Linear Regression (CSMAPLR) adaptation that performs much better, but estimated bit-rates are much higher.

An architecture of HMM coder

Video demonstration of the proposed speech coding approach:

The video demonstrates the baseline system that operates on the lowest limits around 50 bps. Only the recognized phonemes (each encoded with 6 bits) are transmitted through the channel - a TCP/IP connection. No adaptation is used, and so the voice of the synthesized speech is similar for all speakers. The baseline is thus a speaker dependent system. The upper terminal runs HTS server (decoder) and the ASR client (encoder) runs in the lower terminal. ASR module listens to the input microphone and at the end of each utterance, the sequence of phonemes (letters) are transmitted to HTS. HTS module converts the transmitted sequence of letters back to speech (use please full-screen view):

No video? Download Adobe Flash Player

Harmonic plus Noise Model (HNM) speech coding

HNM speech coding is based on Harmonic plus Noise modeling of speech chunks, called also units (or segments). Each unit can be represented by several particular examples (developed during the training). In the audio demonstration above, the models of 64 units were trained from an English broadcast speech database. On the encoder side, the input speech is transcribed using an ASR module into a sequence of units (together with an additional information about the selection of the best example and the timing information). The decoder takes the unit sequence (while sharing the database of unit examples with encoder), and concatenates selected examples using OLA technique to produce the output speech.

An architecture of HNM coder

Last modified: January 16, 2012

© Idiap Research Institute 2012