| Home | People | Projects | Publications | Tour | Links |
Research Projects in the Speech and Signal Processing Lab
The fundamental goal of this research project is to develop a broadly useable framework for pattern analysis and classification of animal vocalizations, by integrating successful models and ideas from the field of speech processing and recognition into bioacoustics. Tasks include automatic vocalization classification and labeling, individual identification, call type classification, behavioral vocalization correlations, language acquisition, and seismic infrasonic communication. Species being targeted for study include domestic and agricultural animals, marine mammals, and several endangered species, in collaboration with researchers at a number of other institutions.
Speech Recognition using Dynamical Systems (MU Speech Lab and Knowledge in Information and Discovery (KID) Lab)This research project focuses on applying state-of-the-art techniques for time-series modeling to the problem of characterizing speech signals. These time-series techniques combine state-space embedding methods and learning algorithms to create highly accurate non-linear models of a system's state. The time-delay embedding technique, taken from dynamical systems theory, is used to reconstruct the state spaces of the speech waveforms, which are characterized statistically and used to differentiate individual phonemes for isolated and continuous speech recognition.
Speech Recognition Using Time Domain Features from Phase Space Reconstructions
Jinjin Ye, Dr. Mike Johnson
This project, a part of the project on Speech Recognition using Dynamical Systems described above, focuses on an analysis of phoneme variability in reconstructed phase spaces and on development of new features computed from these spaces for use in speech recognition.
Speech Recognition using Phase Space Reconstructions
Andrew Lindgren, Dr. Mike Johnson
This project, a part of the project on Speech Recognition using Dynamical Systems described above, focuses on measuring the performance of Gaussian Mixture Models of reconstructed phase spaces for isolated phoneme recognition, both alone and in combination with a standard MFCC feature representation. Results show that combining the two methods leads to an increase in accuracy over that of MFCC's alone.
Speech Signal Enhancement Using Beamforming and Formant Enhancement with a four microphone array
Heather Ewalt, Dr. Mike Johnson
Speech enhancement is very difficult in multiple-speaker environments, due to the interference among the various sources (this is sometimes called the "cocktail party" effect). Yet this task is essential in the development of new hearing aid technologies as well as many other speech quality and intelligibility improvement applications. To address this problem, we are combining the use of beamforming and formant-based speech enhancement algorithms in the context of a microphone array input system. The primary source (here defined as the closest source, not necessarily the loudest) and secondary sources will be identified using source location techniques, extracted using a beamforming algorithm to perform spatial filtering, and enhanced further using a multiple signal formant enhancement method based on formant tracking and enhancement.
Investigating New Models for Stochastic Speech Recognition Using Trajectory Modeling
Franck Hounkpevi, Dr. Mike Johnson
We are presenting new approaches to solve the speech recognition problem. Speech recognition is performed by selecting the string of words that best matches acoustic speech (acoustic observation). Speech can be seen as a variation of observations in articulatory changes. In a Hidden Markov Model (HMM), the widely used model for speech recognition, first order Markov chains are used to model the variability of a phoneme unit by a state transition probability (global non-stationarity) and the variability of observation within a state by a state conditioned time invariant distribution (local stationarity). Observations within states are assumed to be independent. This gives a free trajectory (path) to the acoustic observation in the observation space. But this does not reflect the reality. Indeed, speech is a continuously time varying signal and follows a certain trajectory for a particular (given) phoneme. And because we want to rely on the HMM for the global non-stationarity, the size of a state (speech unit) makes the local stationarity assumption false. Then comes the need of models that accommodate the sub-variation of speech and trajectories within a phone. This letter exhibits the dependency between the observations. We present both models using a single trajectory for a phoneme (Segment Models and Trended HMM) and models associating several different trajectories (cluster of trajectory) to a single phoneme (or any speech unit). Our research is to investigate new models to best represent the trajectory of speech both in terms of time variability (number of points and duration) and in terms of correlation between successive observations.
Speech Enhancement using Bionic Wavelet Denoising
Xiaolong Yuan, Dr. Mike Johnson
In this thesis, the Bionic Wavelet Transform, a new type of wavelet modeled after the nonlinearities of the human auditory system, is used to implement enhancement on noisy speech signals. Several different coefficient thresholding techniques are investigated, with the result that both SNR and SSNR improvement are measurably better using the Bionic wavelet transform as compared to a standard wavelet transform.
Voice Authentication Security System in Speech and Signal Processing Laboratory
Senior Design Team E-13, Speech & Signal Processing Group
Biometrics, the use of individually identifying characteristics such as fingerprints, iris and retina scans, and voiceprints, is a rapidly growing area of research. The Speech and Signal Processing lab, in collaboration with an undergraduate senior design team, is currently designing and implementing a system that uses speaker identification and verification algorithms to control lab access. Users will be able to approach the door, say a password or pass phrase, and be provided entry through an automated door lock mechanism after their identity is confirmed. Key aspects of the project include the design of proximity sensor circuitry and real-time data acquisition tools, implementation of the speaker verification algorithms and software security protocols, testing and evaluation procedure for measuring system accuracy. The system will provide a platform for research and testing of improved verification algorithms, based on speech recognition technologies such as Dynamic Time Warping, Gaussian Mixture Models, and Hidden Markov Models.
Incorporating Language Structure and Prosodic Knowledge into Speech Recognition Systems
Dr. Mike Johnson
Many of the fundamental issues in the field of speech recognition revolve around methods of incorporating additional knowledge sources, beyond the basic spectral information of the speech signal, into the recognition process. These knowledge sources, which include information about prosody, grammatical structure, and semantics, are difficult to quantify with regard to the goal of language understanding, and are even more difficult to interface with the somewhat rigid structures, such as Hidden Markov Models (HMMs), commonly used in acoustic processing. Our research in this area is focused on developing ways of incorporating language models and prosodic information into HMM-based recognition systems.
Dynamical Systems Analysis of Kinematic Speech Data
Dr. Mike Johnson
Study of the physiological development of speech articulator movements is important to our understanding of a variety of speech dysfunctions. In collaboration with the Purdue University Speech Lab (Department of Audiology and Speech Sciences), we are using dynamical systems analysis, specifically chaotic attractor classification, to characterize the movement of speech articulators across age groups. Using a state-of-the-art Optotrak system to record three dimensional movement, the Purdue Speech Lab is able to track the movement of articulators, specifically the upper lip, lower lip, and jaw, on children and young adults. Sentences such as ‘Buy Bobby a puppy.’ and ‘Mommy bakes pot pies’ are rich in voiced and unvoiced labial plosives, and the movement of the lips during several such utterances can be analyzed to give information about the degree of movement consistency of children at various ages. Here at Marquette, we are developing a set of algorithms based on dynamical systems theory which will enable us to characterize speech signals and generate a measure of difference between speech production systems.
Formant Tracking of Noise-Corrupted Speech Signals Based on Auditory Modeling
Troy L. Mack
In this dissertation, a new class of formant tracking algorithms is proposed based on the use of an auditory model. The particular auditory model employed was proposed by Ghitza (1986, 1988, 1992, 1993, 1994). It consists of various mathematical stages that mimic the performance of the human auditory system. Since humans are rather good at understanding speech, even to some extent in noisy situations, it is expected that a formant tracker based on an auditory model may offer the potential of outperforming other formant trackers in the presence of high levels of noise. Various aspects of the auditory model are studied to determine what combination of model features and parameters are most useful in extracting formant information from a speech signal. This formant tracker is evaluated and compared (in terms of percent missed formants and root mean square error of formant frequencies) to two other standard formant trackers on a database of noise-corrupted speech utterances for which accurate formant information is known. The auditory formant tracker is shown to outperform the other formant trackers in high noise situations, especially for the first formant and for male speakers. Informal listening tests in high noise situations employing an existing formant-based processing system also suggest improved performance of the auditory formant tracker over the other formant trackers considered.
Improved Chebyshev Design of Linear-Phase Finite-Impulse-Response Digital Filters
Jian Sun
In this thesis, four new modifications of the Parks-McClellan algorithm are proposed in an attempt to improve this algorithm in various ways. MATLAB programs were developed to realize these new algorithms. A graphical user interface called gui was also developed. The purpose of this program is to help users conveniently design and compare filters. It combines four filter design programs (standard remez, remez la, remez2 and remez9), enabling users to design an optimal Chebyshev filter with four different methods and compare the results easily and quickly.
A New Technique for Blind Reconstruction of Stochastic Processes with Applications to Speech Enhancement
Egide A. V. Houndegla
This dissertation presents a new scheme for the reconstruction of an unobserved signal, which has been degraded by additive noise and convolutional noise. The proposed technique is a combination of cumulant-based blind deconvolution algorithms of different orders. It makes use of the order selection optimization and the de-noising ability of low-rank modeling theory. In addition, unlike most of the iterative blind deconvolution algorithms previously presented in the literature, the proposed algorithm possesses a convergence criterion. Due to its unique features, which are the use of a combination of several statistics of different order, its de-noising ability and its convergence criterion, the proposed algorithm is very attractive for a wide range of applications. The proposed algorithm with minor changes is combined with a proposed wavelet-based de-noising algorithm to implement a speech enhancement algorithm. This proposed technique improves speech quality by removing additive noise and, unlike most of the speech enhancement algorithms proposed in the literature, convolutional noise as well.
Feature-Based Speech Enhancement Techniques Based on Spectral Subtraction and Wiener Filtering
Mike V. Chan
This dissertation presents four new feature-based speech enhancement techniques and demonstrates (both objectively and, in some cases, subjectively) their improvement over the existing methods. These new techniques include feature-based spectral subtraction, feature-based Wiener filtering, iterative feature-based Wiener filtering and constrained iterative feature-based Wiener filtering. In addition, this dissertation addresses two important speech enhancement issues. The first is the usage and limitation of line-spectrum frequencies in speech enhancement. It is shown in this dissertation that with decreasing signal-to-noise ratios the line-spectrum frequencies converge to a predictable set of values, determined by the order of estimation, corresponding to the pure noise case. Heuristically, this study also provides a range of signal-to-noise ratios in which meaningful speech information can be retrieved, and a range of signal-to-noise ratios in which no processing is necessary. The second study involves the extension of these techniques to include a termination criterion. The results indicate that in most cases, these self-terminating techniques combined with the existing iterative processes perform better than simply using a fixed number of iterations.
A Fuzzy Syntactic Approach to Fault Diagnostics by Analysis of Time Sampled Signals
M. Borahan Turner
In this work, we propose a new fuzzy syntactic approach to automated diagnosis. This approach combines the decision-theoretic approach with the syntactic approach. Time-sampled input signals generated by the system under analysis are transformed into a sequence of templates by the decision-theoretic part of the approach. Then, the syntax of the template sequence is analyzed in the syntactical part of the approach. The syntactic analysis is achieved using fuzziness which adds flexibility to the syntactic approach to handle noisy and imperfect information.
The Application of Moment and Cumulant Spectra to Formant Tracking of Speech Embedded in Noise
Lane Branson
The primary focus of this dissertation is the problem of formant tracking of speech embedded in noise. Accordingly, the objective of this work is a formant tracking algorithm that is a significant improvement over existing algorithms. A rather broad approach based upon moments and cumulants was adopted. The rationalization for this approach resulted from studies of autocorrelation which demonstrated clear signal peak enhancement in the frequency domain. Further, an initial literature search revealed that higher-order spectra are valuable tools, particularly in spectral analysis of signals embedded in noise. In view of the foregoing, to find the most effective formant tracking algorithm for speech in noise, it was necessary to investigate, in the context of moments and cumulants, classical (Fourier based) spectral estimation methods.
A Noise Robust Method for Detection of Endpoints of Speech Utterances
Richard J. Santiago
With the widespread usage of voice-activated machinery and telecommunications devices, an important technological challenge is the ability to detect human speech by machines. This problem has been the focus of much research effort since the early seventies. A fundamental mechanism which is required by many speech systems is an endpoint detector. An endpoint detector attempts to identify the points in time at which a continuous speech utterance starts and stops. This thesis proposes one such enpoint detector which works extremely well in noisy environments.
A Spectral Subtraction Method for the Enhancement of Speech Corrupted by Non-White, Non-Stationary Noise
Scott M. McOlash
Spectral subtraction is a popular method for the enhancement of the quality of speech corrupted by additive noise. Implementations of spectral subtraction require an available estimate of the corrupting noise. The spectrum of the noise is usually estimated during a period of time known a priori to be speech free. This estimate is then assumed to remain stationary over the entire noisy speech signal. The approach pursued in this thesis makes use of a standard spectral subtraction algorithm. However, the method does not require a noise estimate obtained from a period of time when speech is known not to exist. Instead, use is made of a continuously running noise estimation algorithm to track the noise in the signal which is input to the spectral subtraction process. As a result, the method is novel in that it (1) does not require a known non-speech interval from which to determine the noise, and (2) can handle both non-white and slowly varying (relative to the speech) noise in an automatic way. Speech features which are used to estimate the noise content during speech are the voiced/unvoiced decision, pitch frequency estimate and the confidence of these features. Results show that the quality of speech degraded by non-white, non-stationary noise can be improved using spectral subtraction with the proposed noise estimation algorithm.
A New Method for Designing FIR Digital Filters with Low Coefficient Sensitivity
Egide V. Houndegla
A new method of designing FIR digital filters that generalizes existing cascaded FIR prefilter-equalizer methods is presented. The proposed prefilter is a parallel connection of the so-called recursive running sum and a very low order FIR filter and has an impressive passband performance. The equalilizer, a FIR filter, is designed via Chan’s method for designing FIR filters. The method can be used to implement most practical FIR filters and gives smaller coefficient sensitivity. For a certain class of filter design problems, this method provides a significant reduction in the total number of bits used for the multiplier coefficients. An attractive feature of the porposed method is the applicability to unequally spaced samples due to Chan’s technique.
New Design Formulas for FIR Filters with Arbitrary Shapes
Mike V. Chan
New formulas are developed for the impulse response of zero-phase digital filters with arbitrary shapes. Using standard windowing techniques inverse Fourier transforms, and numerical integration techniques, these formulas provide a simple basis for designing linear-phase FIR digital filters with arbitrarily magnitude responses. Both general and specific cases are considered and illustrated with examples. The similarities and differences between the new techniques and their implementation results are also discussed. Finally, applications of the new formulas to speech processing are considered.
Feature Based Speech Intelligibility Enhancement in High Noise Levels
Robert J. Conway
In many situations where speech is used as a means of transmitting information, the presence of background noise has the effect of reducing the intelligibility of the speech. Often, the speech signal becomes corrupted to the point where it is no longer understandable and the intended message is lost. It is desirable in these situations to have a means of enhancing the noise-corrupted speech so that its intelligibility is restored. Past and current research has indicated that conventional signal restoration techniques are not adequate for this purpose. Therefore, four techniques which exploit the unique properties of speech are proposed. The techniques utilize information about features extracted from the noise-corrupted speech. These features are generally accepted as important factors affecting the intelligibility of speech. The proposed techniques emphasize these features in the noise-corrupted speech signal with the goal of enhancing its intelligibility.
A Time Warping Digital FIR Filter for Nuclear Magnetic Resonance Echoes Collected with Time Varying Readout Gradients
Steven Robert Wedan
The goal of this thesis is to investigate the above mentioned problem in detail and to propose a mo lification which, when performed on a current MR digital receiver, will transfoi m the MR system from one which requires flat top readout gradients to one which allows the readout gradient to assume any pre defined function. With such an MR system, the readout gradient may be a sinusoid, in which case the bandwidth of the gradient generating subsystem is minimized. The proposed modification is to replace a standard digital finite impulse response (FIR) filter with a new type of digital FIR filter that uses a warped time axis in the development of its coefficients and allows for these coefficients to change with every desired output point. Furthermore, the filter is specified such that the output sampling period (the time between output points) may be time varying and not, necessarily, a multiple of the input sampling frequency. We call this filter a time warping digital FIR (TWE) filter.
Transient Reduction for Digital Wall Filters in Doppler Ultrasound
Scott Otterson
In medical ultrasound, the Doppler shift of a segment of acoustic pulses transmitted into the body is used to measure blood velocity. Since tissue and bone are much more echogenic than blood, the signal from these two components collectively called the wall signal must be removed before the blood signal can be processed for spectral display. Blood usually moves at higher velocities than tissue so its contribution to the receive echo is in the higher frequencies; the wall signal is often removed with an analog high pass filter. With the wall signal removed, segments of Doppler data are analyzed with an FFT and displayed in a real-time, waterfall format. In newer, time multiplexing ultrasound machines, Doppler data segments are interrupted by data segments used for other types of imaging. This presents a challenge to wall filter implementation; the discontinuity at the beginning of each Doppler segment causes the wall filter to ring, polluting the Doppler data. On current ultrasound machines, the solution has been to wait for the filter transients to die down before doing Doppler spectral analysis. During transient die down up to 2ms information is lost. decreasing the timeprecision of both imaging and blood velocity measurements. The subject of this thesis is the lessening of the effect of the wall filter ring, particularly in the case when the ultrasound data arrives in a digital format which is processed with a digital filter.
Tile Effects of Transmission Source Speed and Design of Emission image Quality Using Local Statistical Noise Estimation
Karen J. Leaf Lensmire
Positron Emission Tomography (PET) is a medical imaging technique based on the detection of gamma radiation. Noise in PET is primarily due to low counting statistics in the acquired data. Several corrections must be applied to the raw data to form quantitatively accurate images; these corrections should add as little noise as possible to the images. The correction for the attenuation of photons through the subject contributes more noise potential than other corrections. This thesis demonstrates a new method for improving statistical uniformity of image data through improving the attenuation correction technique by varying the speed of an orbiting rod source during the transmission scan to match the subject’s shape as positioned in the imaging field of view. This thesis demonstrates this improvement by extending the concept of noise equivalent counts through reconstruction of all data elements.
Evaluation of Methods for Approximating the Short-Time Energy Contour of Speech in Noise Based on Intelligibility Tests
Steven A. Dimino
The goal of the work described in this thesis is to determine the effect that adjusting the short-time engery contour of corrupted speech to match that of the uncorrupted speech has on the intelligibility of the resulting speech. While the effect that the short-time energy contour measured from the uncorrupted speech is of interest for comparisons sake, this signal is never available in any realistic situation, hence, a short-time energy contour estimate must be obtained in some way from the corrupted speech. To this end, various methods for estimating the short-time energy contour of speech signals in high levels of background noise are presented. These methods are evaluated on the basis of both subjective and objective measures. The objective measure consists of the mean-squared error comparison between the short-time energy contour estimated from the noisy signal and the short-time energy contour measured from the uncorrupted signal. The subjective measure consists of the Diagnostic Rhyme Test which is a test that is designed to measure the effects a processing scheme has on the intelligibility of the processed speech. Both measures are included since a high level of correlation between objective distance measures and subjective intelligibility scores has not been established in the literature. As an appendix to the thesis, a discussion of various modern spectral estimation techniques is presented with emphasis on the techniques used for identifying frequency components when the signal is embedded in high level noise.
Talk-through for a Telephone Voice Recognition Demonstration System
Daniel J. Sebald
This research stems from work done by researchers at AT&T Bell Laboratories who employed echo cancellation, a common telephone signal processing technology utilizing an adaptive filter, at the front end of an automated speech recognizer. The feasibility of this method is investigated and, based upon drawbacks of the method, a modified version of echo cancellation for improved operation is proposed. The new algorithm, called the modified least-mean-square or MLMS algorithm, introduces delay into the updating of filter coefficients to avoid a condition known as divergence, i.e., maladjustment of coefficients, when both prompting messages and customer speech are present. Presented in this thesis is a new way of measuring the performance of a speech recognizer based upon concepts from communications and information theory. The measure, called the mutual information to entropy ratio, is shown to have some advantages over conventional statistical measures of performance. Specifically, it incorporates all possible outcomes of a recognition process correct, deleted, inserted, and substituted into one measure.
Formant Tracking to Improve the Intelligibility of Noise Corrupted Speech
Thomas J. Svoren
In speech processing one of the important applications is to improve the intelligibility of speech in low signal-to-noise ratios. The approach being pursued at Marquette to accomplish this is performed by enhancing speech with extracted features. These extracted features include the pitch frequency, the noise spectrum, the energy contour, the voice/unvoice decision, and the formant frequencies. For the purpose of improving speech intelligibility the formant frequencies are crucial. It has been shown that an improvement of +9 decibels (dBs) in the speech from which the formant frequencies are extracted results in a substantial improvement in intelligibility in a cepstral based speech enhancement algorithm. This thesis documents an examination of different formant tracking algorithms to achieve a more robust speech signal in low signal to noise ratios (SNRs). Quantitative results such as PMS error, missed formants within a frequency range around the correct formant and extraneous formants are considered. The Diagnostic Rhyme Test (DRT) is used to provide qualitative results to determine effects on speech intelligibility.
Speech Intelligibility Testing Using Subjective and Objective Methods
Teresa M. Sippel
In this thesis two speech intelligibility testing methods are presented. The first, a subjective test, is a variation of the Diagnostic Rhyme Test (DRT) as described by William Voiers. The second, an objective test, consists of an implementation of two speech distance measures. Existing subjective and objective speech intelligibility tests are also discussed and compared. The Diagnostic Rhyme Test (DRT) is discussed in detail and its implementation and results are described. The Cepstral and Log Likelihood speech distance measures are discussed and their implementation and results are described. Finally, the results of the DRT intelligibility scores are compared and contrasted with the distance measure results for various signal to noise ratios.