Journal of the Audio Engineering Society

2017 April - Volume 65 Number 4


Single-Channel Speech Enhancement Based on Psychoacoustic Masking

Authors: Zhou, Tingting; Zeng, Yumin; Wang, Rongrong

Speech enhancement processing can improve the performance of speech communication systems in noisy environments, such as in mobile communication systems, speech recognition, or hearing aids. Single-channel speech enhancement is more difficult than it is with multiple channels since there is no independent source of information that can help separate the speech and noise signals. This paper addresses single-channel speech enhancement based on the masking properties of the human auditory system. A complete implementation of speech enhancement using psychoacoustic masking is presented. The incorporation of temporal masking along with simultaneous masking (as compared to using only simultaneous masking) produces results that are more consistent with human auditory characteristics. The combined masking is then used to adapt the subtraction parameters to obtain the best trade-off among noise reduction, speech distortion, and the level of residual perceptual noise. The application of objective measures and subjective listening tests demonstrate that the proposed algorithm outperforms comparable speech enhancement algorithms.

This paper describes a predictor for binaural speech intelligibility that computes speech reception thresholds (SRT) without the need to perform subjective listening tests. Although listening tests are considered to be the most reliable indicators of performance, such tests are time consuming and costly. The proposed model computes SRTs in two stages. First, it calculates the binaural advantage. Then, it derives the SRTs based on the computed mutual information of the speech and mixture envelopes. Listening tests were conducted with 13 normal-hearing listeners in 15 spatial configurations, covering one, two, and three babble interferers. The proposed predictor performs as well as the baseline model in predicting the intelligibility of binaural vowel-consonant-vowel signals contaminated by multiple nonstationary babble noise sources. The model is evaluated in anechoic conditions and compared with subjective data as well as with the predictions obtained from a baseline binaural speech intelligibility model.

Personalized Object-Based Audio for Hearing Impaired TV Viewers

Authors: Shirley, Ben Guy; Meadows, Melissa; Malak, Fadi; Woodcock, James Stephen; Tidball, Ash


Age demographics have led to an increase in the proportion of the population suffering from some form of hearing loss. The introduction of object-based audio to television broadcasting has the potential to improve the viewing experience for millions of hearing impaired people. Personalization of object-based audio can assist in overcoming difficulties in understanding speech and the narrative audio. This research presented describes a Multi-Dimensional Audio (MDA) implementation of object-based clean audio that presents independent object streams based on object-category elicitation. Evaluations were carried out with hearing impaired people, and participants were able to personalize audio levels independently for four object-categories using an on-screen menu: speech, music, background effects, and foreground effects related to on-screen events. Results show considerable preference variation across subjects but nevertheless the expanding object-category personalization beyond a binary speech/nonspeech categorization can substantially improve the viewing experience for some hearing impaired people.

Systems that recognize the emotional content of music and systems that provide music recommendations often use a simplified 4-quadrant model with categories such as Happy, Sad, Angry, and Calm. Previous research has shown that both listeners and automated systems often have difficulty distinguishing low-arousal categories such as Calm and Sad. This paper explores what makes these categories difficult to distinguish. 300 low-arousal excerpts from the classical piano repertoire were used to determine the coverage of the categories Calm and Sad in the low-arousal space, their overlap, and their balance to one another. Results show that Calm was 40% bigger in terms of coverage than Sad, but on average, Sad excerpts were significantly more negative in mood than Calm excerpts that were positive. Calm and Sad overlapped in nearly 20% of the excerpts, meaning 20% of the excerpts were about equally Calm and Sad. Calm and Sad covered about 92% of the low-arousal space. The largest holes were for excerpts considered Mysterious and Doubtful. Due to the holes in the coverage, the overlaps, and imbalances, the Calm-Sad model adds about 6% more errors when compared to asking users directly whether the mood of the music is positive or negative. Nevertheless, the Calm-Sad model is still useful and appropriate for many applications.

Engineering reports

With the expanding market for low-cost microphones, raising the manufacturing yields (lowering the rejection) becomes a central technical issue. This report explores a case study of possible causes of electret microphone rejection. Because the polarization voltage developed on the microphone diaphragm has a direct effect on the microphone sensitivity, hence variations in this voltage were investigated. Initial studies were conducted with an arrangement of fixture plate (bearing holes for holding microphones) which was placed on two base (support) plates. The investigations considered 23 microphone positions across 11 readings. The acceptable and unacceptable polarization voltages were designated, and the corresponding failure percentage was determined. Furthermore, one base plate was removed to increase the distance between top electrode and diaphragm, and polarization voltage was measured. The results showed an appreciable reduction in the polarization voltage that indicated a promising reduction in the microphone failure rate. For a representative electret condenser microphone, a nonlinear variation between sensitivity and polarization voltage was established. Statistical analysis revealed that measurement data is symmetric and distributed normally. The proposed modification, when implemented on the shop floor, reduced rejection from 33% to 16%.


[Feature] “Broadcasting” is now done as much on the internet as it is over the airwaves. We summarize two workshops presented at the 141st Convention on the latest developments in this field, including delivery of immersive audio to the home and audio for Over-the-top (OTT) TV.

2017 Audio Forensics Conference Preview, Arlington

2017 Semantic Audio Conference Preview, Erlangen


AES Conventions and Conferences


Table of Contents

Cover & Sustaining Members List

AES Officers, Committees, Offices & Journal Staff

Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.

Choose your country of residence from this list:

Skip to content