F. Thomas Bruss
Human societies and survival : A Branching Process Model
Reference 8 Version 1 Date 09/09/2013

The use of speech in person-machine communication has been extensively studied during the past decades. This presentation is concerned with automatic speech recognition (ASR), i.e., the use of speech for controlling a machine. ASR has proven to be useful in various situations (telecommunications, handicapped, hand-free actions, etc.). Commercial products exist since more than 20 years, even though several important problems are not still unsolved, in particular the lack of robustness of ASR systems in difficult situations such as new speakers, presence of noise, etc.

The first step in the ASR process consists in analysing the speech signal in order to extract pertinent parameters for recognition. Frequential analysis based on Fourier transform is useful to obtain information about speech parameters such as formant frequencies. Most present ASR systems use MFCC (Mel Frequency Cepstrum Coefficients) parameters, based on a cepstral analysis of the speech wave.

Once the speech waveform has been parameterised, the vectors of parameters are then used for recognizing the word or the sentence that has been pronounced. Dynamic programming techniques have been used until the late 1970s in order to compensate for the non linear variations in time structures of patterns. Now, all systems use stochastic models, especially Hidden Markov Modesl (HMM) to carry out the recognition process in a Bayesian framework. For continuous speech, language models are also used under the form of n-grams models giving the probabilities of sequences of n words. The development of an ASR system thus implies a preliminary training phase during which the various conditional probabilities are learned. This phase necessitates to collect very large databases of labelled speech samples.

Even though important progress has been made in the recognition performance, present systems lack of robustness, i.e., the recognition rate decreases dramatically when a system is not used in the conditions under which it has been trained: noise level, recording conditions, speakers. Various methods have been proposed in order to solve this problem: preprocessing of the speech signal (filtering, spectral subtraction, etc.), robust parameterisation, and adaptation of a system to new conditions.

An important research effort is still necessary to design efficient systems for advanced applications such as media or meeting transcription, or speech to speech translation. Pluridisciplinarity is mandatory in order to collect and label speech databases, and also to model a large number of facts and knowledge about the speech production and perception processes.