Journal of the Audio Engineering Society

2017 January/February - Volume 65 Number 1/2


An adaptive multichannel Wiener Filter (MWF) can be used for joint dereverberation and noise reduction in hearing aids. Using the short time objective intelligibility (STOI) measure, the authors compare bilateral and binaural configurations of the MWF for several cases: (a) different arrival direction of arrival (DoA) of the target speech, (b) different errors in the assumed DoA, and (c) different levels of microphone self-noise. While being much less robust against DoA errors, the binaural MWFs outperformed the bilateral MWF if the correct DoA is assumed. Furthermore, the bilateral MWF was shown to be affected by the microphone self-noise more than the binaural MWFs. A listening test indicated that a well-steered binaural MWF is able to improve the speech intelligibility in a noisy and reverberant speech scenario, and that this improvement is greater than that of the bilateral MWF. This was true despite the fact that the binaural MWF distorted the binaural cues such that no binaural advantage could be obtained. The post-filters of the bilateral and the binaural MWFs significantly improved the measured speech intelligibility because of the particular maximum likelihood spectral estimator that was used to compute the spectral gain of the filters.

A General Framework for Incorporating Time–Frequency Domain Sparsity in Multichannel Speech Dereverberation

Authors: Jukic, Ante; van Waterschoot, Toon; Gerkmann, Timo; Doclo, Simon

Effective speech dereverberation is a prerequisite in such applications as hands-free telephony, voice-based human-machine interfaces, and hearing aids. Blind multichannel speech dereverberation methods based on multichannel linear prediction (MCLP) can estimate the dereverberated speech component without any knowledge of the room acoustics. This can be achieved by estimating and subtracting the undesired reverberant component from the reference microphone signal. This report presents a general framework that exploits sparsity in the time–frequency domain of a MCLP-based speech dereverberation. The framework combines a wideband or a narrowband signal model with either an analysis or a synthesis sparsity prior, and generalizes state-of-the-art MCLP-based speech dereverberation methods.

The reproduction of speech over loudspeakers in a reverberant environment is often encountered in daily life, as for example, in a train station or during a telephone conference. Spatial reverberation degrades intelligibility. This study proposes two perceptually motivated preprocessing approaches that are applied to the dry speech before being played into a reverberant environment. In the first algorithm, which assumes prior knowledge of the room impulse response, the amount of overlap-masking due to successive phonemes is reduced. In the second algorithm, emphasizing onsets is combined with overlap-masking. A speech intelligibility model is used to find the best parameters for these algorithms by minimizing the predicted speech reception thresholds. Listening tests show that this preprocessing method can indeed improve speech intelligibility in reverberant environments. In listening tests, Speech Reception Thresholds improved up to 6 dB.

Methods are proposed for modifying the reverberation characteristics of sound fields in rooms by employing a loudspeaker with adjustable directivity, realized with a compact spherical loudspeaker array (SLA). These methods are based on minimization and maximization of clarity and direct-to-reverberant sound ratio. Significant modification of reverberation is achieved by these methods, as shown in simulation studies. The system under investigation includes a spherical microphone array and an SLA comprising a multiple-input multiple-output system. The robustness of these methods to system identification errors is also investigated. Finally, reverberation and dereverberation results are validated by a listening experiment.

Digital audio effects, such as adding artificial reverberation, are actually transformations on an audio signal, where the transformation depends on a set of control parameters. Users change parameters over time based on the resulting perceived sound. This research simulates the process of automating the parameters using supervised learning to train classifiers so that they automatically assign effect parameter sets to audio features. Training can be done a-priori, as for example, by an expert user of the reverberation effects, or online by the user of such an effect. An automatic reverberator trained on a set of audio is expected to be able to apply reverberation correctly on similar audio defined by such properties as timbre, tempo, etc. For this reason, in order to create a reverberation effect that is as general as possible, training requires a large and diverse set of audio data. In one investigation, the user provides monophonic examples of desired reverberation characteristics for individual tracks taken from the Open Multitrack Testbed. This data was used to train a set of models that will automatically apply reverberation to similar tracks. The model was evaluated using classifier f1-scores, mean squared errors, and multistimulus listening tests. The best features from a 31-dimensional feature space were used.

Object-Based Reverberation for Spatial Audio

Authors: Coleman, Philip; Franck, Andreas; Jackson, Philip J. B.; Hughes, Richard J.; Remaggi, Luca; Melchior, Frank


To enable future audio systems to be more immersive, interactive, and easily accessible, object-based frameworks are currently being explored as a means to that ends. In object-based audio, a scene is composed of a number of objects, each comprising audio content and metadata. The metadata is interpreted by a renderer, which creates the audio to be sent to each loudspeaker with knowledge of the speci?c target reproduction system. While recent standardization activities provide recommendations for object formats, the method for capturing and reproducing reverberation is still open. This research presents a parametric approach for capturing, representing, editing, and rendering reverberation over a 3D spatial audio system. A Reverberant Spatial Audio Object (RSAO) allows for an object to synthesize the required reverberation. An example illustrates a RSAO framework with listening tests that show how the approach correctly retains the room size and source distance. An agnostic rendering can be used to alter listener envelopment. Editing the parameters can also be used to alter the perceived room size and source distance; greater envelopment can be achieved with the appropriate reproduction system.

Time domain wave-based methods sidestep the consequences of the simplifying hypotheses that are part of geometric ray-based methods when simulating, modeling, and analyzing room acoustics. This paper illustrates construction techniques for wave-based simulation methods when applied to nontrivial problems in room acoustics, including irregular geometries and frequency-dependent boundary conditions, which are extended to include viscothermal loss effects in air. However, algorithm design has many challenges when applied to realistic room con?gurations. The main design criteria are arbitrary room geometry, general passive frequency-dependent and spatially-varying wall conditions, and adequate modeling of viscothermal and relaxation effects. The main difficulty is the construction of simulation methods that are numerically stable. Finite volume time domain (FVTD) methods generalize certain ?nite difference time domain (FDTD) methods, while allowing for stability analysis. FVTD methods with frequency-dependent impedance boundary conditions are extended to handle such viscothermal loss effects in air. An energy-based analysis of numerical stability is presented in detail, illustrating conditionally and unconditionally stable forms that are extended to cover the case of dissipation through a time-integrated energy balance.

Confidence Measures for Nonintrusive Estimation of Speech Clarity Index

Authors: Parada, Pablo Peso; Sharma, Dushyant; van Waterschoot, Toon; Naylor, Patrick A.

In many situations, measuring the amount and type of reverberation in a room assumes that the room impulse response is available for the computation. When that impulse response is not available, a nonintrusive room acoustic (NIRA) method must be used. In this report, the authors use the C50 clarity index to characterize reverberation in the signal because it has been shown to be more highly correlated with the speech recognition performance then other measures of reverberation. Multiple features are extracted from a reverberant speech signal and they are then used to train a bidirectional long short-term memory model that maps from the feature space into the target C50 value. Prediction intervals, which provide an upper and lower bound of the estimate, can be derived from the standard deviation of the per frame estimations. Confidence measures are then obtained by normalizing these prediction intervals. These measures are highly correlated with the absolute C50 estimation errors. The performance of the prediction intervals and confidence measure are shown to be consistent in many different noisy reverberant environments. The procedure proposed in this paper for deriving C50 prediction intervals and confidence measures could as well be applied to other room acoustic parameter estimation, for example, T60 (reverberation decay time to 60 dB) or DRR (direct to reverberation ratio).

User Preference on Artificial Reverberation and Delay Time Parameters

Authors: Pestana, Pedro Duarte; Reiss, Joshua D.; Barbosa, Álvaro


This research explores the common belief that the settings of artificial reverb and delay time in music production are strongly linked to musical tempo and related factors. Two subjective tests were used to evaluate user preference of young adults with formal training in audio engineering on artificial reverberation and delay time. Results show there is a clear relationship between musical tempo and delay time preference, but reverberation time preferences cannot be explained in this way. Post-test interviews with subjects helped explain that the setting of reverberation time is seen as too multidimensional to be correlated to a single factor (namely song tempo) because stylistic attributes and song genre have a bearing on user’s choice. However, results still seem to indicate that even if song tempo is not the main correlate, there may be other low-level factors that strongly explain this variable.

Perceptual Evaluation and Analysis of Reverberation in Multitrack Music Production

Authors: De Man, Brecht; McNally, Kirk; Reiss, Joshua D.


Despite the prominence of artificial reverberation in music production, there are few studies that explore the conventional usage and the resulting perception in the context of a mixing studio. Research into the use of artificial reverberation is difficult because of the lack of standardized parameters, inconsistent interfaces, and a diverse group of algorithms. In multistimuli listening tests, trained engineers were asked to rate 80 mixes that were generated from 10 professional-grade music recordings. Annotated subjective comments were also analyzed to determine the importance of reverberation in the perception of mixes, as well as classifying mixes as having too much or too little overall reverberation. The results support the notion that a universally preferred amount of reverberation is unlikely to exist, but the upper and lower bounds can be identified. The importance of careful parameter adjustment is evident from the limited range of acceptable feature values with regard to the perceived amount of reverberation relative to the just-noticeable differences in both reverberation loudness and early decay time. This study confirms previous findings that a perceived excess of reverberation typically has a detrimental effect on subjective preference. The ability to predict the desired amount of reverberation with a reasonable degree of accuracy has applications in automatic mixing and intelligent audio effects.

Speech signals recorded in an enclosed space by microphones at a distance from the speaker are often corrupted by reverberation, which arises from the superposition of many delayed and attenuated copies of the source signal. Because reverberation degrades the signal, removing reverberation would enhance quality. Dereverberation techniques based on acoustic multichannel equalization are known to be sensitive to room impulse response perturbations. In order to increase robustness, several methods have been proposed, as for example, using a shorter reshaping filter length, incorporating regularization, or applying a sparsity-promoting penalty function. This paper focuses on evaluating the performance of these methods for single-source multi-microphone scenarios, using instrumental performance measures as well as using subjective listening tests. By analyzing the correlation between the instrumental and the perceptual results, it is shown that signal-based performance measures are more advantageous than channel-based performance measures to evaluate the perceptual speech quality of signals that were dereverberated by equalization techniques. Furthermore, this analysis also demonstrates the need to develop more reliable instrumental performance measures.

Engineering reports

A Rapid Sensory Analysis Method for Perceptual Assessment of Automotive Audio

Authors: Kaplanis, Neofytos; Bech, Søren; Tervo, Sakari; Pätynen, Jukka; Lokki, Tapio; Waterschoot, Toon; Jensen, Søren Holdt

As today’s automotive audio systems rapidly evolve, it is unclear if the current perceptual assessment protocols fully capture the human sensations evoked by such new systems. The highly complex and acoustically hostile environment of the automobile cabin hinders the effectiveness of standard objective metrics, while lacking robustness, repeatability, and perceptual relevance. This report examines the current assessment protocols and their identified limitations. A new design of an assessment protocol is proposed. It uses the Spatial Decomposition Method for acquiring, analyzing, and reproducing the sound field in a laboratory over loudspeakers, thereby allowing instant comparisons of automotive audio systems. A rapid sensory analysis protocol, the Flash Profile, is employed for evaluating the perceptual experience using individually elicited attributes, in a time-efficient manner. A pilot experiment is described, where experts, experienced, and naive assessors followed the procedure and evaluated three sound fields. Current findings suggest that this method allows for the assessment of both spatial and timbral properties of automotive sound.


Rolling out AES67

Authors: Rumsey, Francis

[Feature] AES67 has reached the point where it has been accepted very widely by the industry. It provides a mode of operation that allows one system to communicate audio information with another, while still allowing for enhanced features to be added on top. While certain features such as device/stream discovery were intentionally omitted from the standard, open solutions are emerging that will deal with this.

2017 International Conference on Automotive Audio, Call for Contributions, San Francisco

143rd Convention, Call for Contributions, New York


Annual Report

AES Conventions and Conferences


Table of Contents

Cover & Sustaining Members List

AES Officers, Committees, Offices & Journal Staff

Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.

Choose your country of residence from this list:

Skip to content