A sample of speech signal

Statistical Models for Speech Recognition   


Key Projects

We have devoted most of our research effort on the problem of modeling co-articulation phenomenon in continuous speech.
*   Incorporating dynamic trends in HMM states
*   Articulatory feature based speech units
*   Co-articulation modeling through data smoothing methods
*   Understanding variations in speech data
*   New Classification Methods for Speech Recognigiotn


Incorporating dynamic trends in HMM states

The standard method of hidden Markov modeling (HMM) is widely used for speech recognition since th 70s. A hidden Markov model contains the mathematical structure of a (hidden) Markov chain with each state associated with a distinct independent and identically distributed (IID) or a stationary random process. The model is used as a type of data-generator for speech signals and approximates the near continuously varying speech signals in a piece-wise constant manner. Such an approximation would be a reasonably good one when each state is intended to represent only a short portion of sonorant sounds. However, since the acoustic patterns of continuously spoken speech sounds are nearly never stationary in nature it would be desirable to improve this rather poor piece-wise constant approximation in general.


Hidden Markov Models with Non-stationary States

Don X. Sun
Li Deng (U. of Waterloo)

We propose, implement, and evaluate a class of non-stationary-state hidden Markov models (HMMs) having each state associated with a distinct polynomial regression function on time plus white Gaussian noise. The model represents the transitional acoustic trajectories of speech in a parametric manner, and includes the standard stationary-state HMM as a special, degenerated case. We develop an efficient dynamic programming technique which includes the state sojourn time as an optimization variable, in conjunction with a state-dependent orthogonal polynomial regression method, for estimating the model parameters. Experiments on fitting models to speech data and on limited-vocabulary speech recognition demonstrate consistent superiority of these new non-stationary-state HMMs over the traditional stationary-state HMMs.

Paper: Postscript (621Kb)

Speech recognition using hidden Markov models with polynomial regression functions as non-stationary states

Li Deng (U. of Waterloo)
M. Aksmanovic
Don X. Sun
C.F.J. Wu

IEEE Transactions on Speech and Audio Processing, 2:4 (1994), pp. 507-520.


Articulatory Feature-Based Hidden Markov Models

We have been developing a feature-based general statistical framework for automatic speech recognition via novel designs of minimal or atomic units of speech, aiming at a parsimonious scheme to share the inter-word and inter-phone speech data and at a unified way to account for the context-dependent behaviors in speech. The basic design philosophy has been motivated by the theory of distinctive features (Chomsky and Halle, 1968; Stevens, 1986) and by a new form of phonology which argues for use of multi-dimensional articulatory structures (Browman and Goldstein, 1992). In this paper, we present the feature-based recognizer developed most recently, which is capable of operating on all classes of English sounds. We provide detailed descriptions of the design considerations for the recognizer and of key aspects of the design process. This process, which we call lexicon ``compilation'', consists of three elements: A standard phonetic classification task from the TIMIT database is used as a test-bed to evaluate the performance of the recognizer. The experimental results provide preliminary evidence for the effectiveness of our feature-based approach to speech recognition.


A Statistical Framework for Automatic Speech Recognition Using the Atomic Units Constructed From Overlapping Articulatory Features

L. Deng and
Don X. Sun
Journal of the Acoustical Society of America, 95:5 (May 1994), pp. 2702-2719.

Paper: Postscript (249Kb)


Estimation of Spectral Trajectories

Don X. Sun

A new method is developed to estimate the trajectories of spectral center-of-gravity using robust statistical models with penalized weighted spline smoothers. Most of the existing methods for tracking speech formant trajectories are based on dynamic programming algorithms with certain continuity constraints on the formant frequencies. The objective functions (or loss functions) in these approaches are usually ad hoc and have very complex expressions that are difficult to optimize. Also, many existing methods rely on the accuracy of the LPC spectral peaks and are not very robust against possible missing or spurious peaks. Instead of using the peaks of the LPC spectral functions, we propose a new approach to the estimation of the ``center-of-gravities'' in spectrogram using mixture models of spline smoothers.

Paper: Postscript (203Kb)


Analysis of Acoustic-Phonetic Variations in Speech

Don X. Sun
Li Deng (U. of Waterloo)

We applied a hierarchically structured Analysis of Variance (ANOVA) method to analyze, in a quantitative manner, the contributions of various identifiable factors to the overall acoustic variability exhibited in fluent speech data of TIMIT processed in the form of Mel-Frequency Cepstral Coefficients. The results of the analysis show that the greatest acoustic variability in TIMIT data is explained by the difference among distinct phonetic labels in TIMIT, followed by the phonetic context difference given a fixed phonetic label. The variability among sequential sub-segments within each TIMIT-defined phonetic segment is found to be significantly greater than the gender, dialect region, and speaker factors.

Paper: Postscript (100Kb)


A Support Vector/Hidden Markov Model Approach to Phoneme Recognition

Steven E. Golowich
Don X. Sun

A novel method for classifying frames of speech waveforms to a given set of phoneme classes is proposed. The method involves combining an approximation to multiple smoothing spline logistic regression (known as the ``Support Vector Machine'' in the machine learning literature) with hidden Markov models (HMMs). The method is compared with the standard technique in the speech recognition literature, that of HMMs with Gaussian mixture models. Both models were trained and tested using data drawn from the publicly available TIMIT database. Our results show that the two types of models are competitive for this data, but have very different structures. Such differences can be used to improve recognition rates by combining the two types of classifiers.

Paper: PDF (83Kb)


Last modified: $Date: 2000/11/02 21:14:27 $

dxsun@research.bell-labs.com