Jean-Paul Haton
Reconnaissance automatique de la parole Passé, présent et futur
Reference 7 Version 1 Date 18/07/2013
Article
Introduction
The use of speech in person-machine communication has been extensively studied during the past decades. This presentation is concerned with automatic speech recognition (ASR), i.e., the use of speech for controlling a machine. ASR has proven to be useful in various situations (telecommunications, handicapped, hand-free actions, etc.). Commercial products exist since more than 20 years, even though several important problems are not still unsolved, in particular the lack of robustness of ASR systems in difficult situations such as new speakers, presence of noise, etc. The first step in the ASR process consists in analysing the speech signal in order to extract pertinent parameters for recognition. Frequential analysis based on Fourier transform is useful to obtain information about speech parameters such as formant frequencies. Most present ASR systems use MFCC (Mel Frequency Cepstrum Coefficients) parameters, based on a cepstral analysis of the speech wave. Once the speech waveform has been parameterised, the vectors of parameters are then used for recognizing the word or the sentence that has been pronounced. Dynamic programming techniques have been used until the late 1970s in order to compensate for the non linear variations in time structures of patterns. Now, all systems use stochastic models, especially Hidden Markov Modesl (HMM) to carry out the recognition process in a Bayesian framework. For continuous speech, language models are also used under the form of n-grams models giving the probabilities of sequences of n words. The development of an ASR system thus implies a preliminary training phase during which the various conditional probabilities are learned. This phase necessitates to collect very large databases of labelled speech samples. Even though important progress has been made in the recognition performance, present systems lack of robustness, i.e., the recognition rate decreases dramatically when a system is not used in the conditions under which it has been trained: noise level, recording conditions, speakers. Various methods have been proposed in order to solve this problem: preprocessing of the speech signal (filtering, spectral subtraction, etc.), robust parameterisation, and adaptation of a system to new conditions. An important research effort is still necessary to design efficient systems for advanced applications such as media or meeting transcription, or speech to speech translation. Pluridisciplinarity is mandatory in order to collect and label speech databases, and also to model a large number of facts and knowledge about the speech production and perception processes.
Authors
Jean-Paul Haton
Né le 30 juin 1943. Membre associé de la Classe Technologie et Société. Élu associé le 3 mars 2012. Informaticien. Agrégé de physique (École Normale Supérieure de Saint-Cloud). Docteur ès Sciences Physiques.Professeur émérite de l’Université de Lorraine.Docteur honoris causa de l'Université de Genève.Membre de l’Institut Universitaire de France.Vice-président de l’Académie Lorraine des Sciences.Auditeur de l’Institut des Hautes Études de Défense Nationale.
 
Categories
Physics
Versions

Article

   

Version

   

Date

   

Title

7

   

1

   

18/07/2013

   

Reconnaissance automatique de la parole