Journal of the Audio Engineering Society

2013 July/August - Volume 61 Number 7/8


Acoustic Zooming by Multi-Microphone Sound Scene Manipulation

Authors: van Waterschoot, Toon; Tirry, Wouter Joos; Moonen, Marc

Camera zooming would be more compelling if the audio was subjected to a corresponding zoom that matched the video. Psychophysical and neuroimaging results suggest that a cross-modal approach to zooming facilitates multisensory integration. Because auditory distance perception is primarily determined by sound intensity, an audiovisual zoom effect can be obtained by matching the levels of different sources in a sound scene with their visually perceived distance. The authors propose a general theory for independent sound source level control that can be used to attain an acoustic zoom effect. The theory does not require sound source separation, which reduces computational load. An efficient implementation using fixed and adaptive spatial and spectral noise-reduction algorithms is proposed and evaluated. Experimental results using an array of a small number of low-cost microphones confirm that the proposed approach is particularly suited for consumer audiovisual capture applications.


Reverberation is a problem for source separation algorithms. Because the precedence effect allows human listeners to suppress the perception of reflections arising from room boundaries, numerous computational models have incorporated the precedence effect. However, relatively little work has been done on using the precedence effect in source separation algorithms. This paper compares several precedence models and their influence on the performance of a baseline separation algorithm. The models were tested in a variety of reverberant rooms and with a range of mixing parameters. Although there was a large difference in performance among the models, the one that was based on interaural coherence and onset-based inhibition produced the greatest performance improvement. There is a trade-off between selecting reliable cues that correspond closely to free-field conditions and maximizing the proportion of the input signals that contributes to localization. For optimal source separation performance, it is necessary to adapt the dynamic component of the precedence model to the acoustic conditions of the room.

Real-Time Speech Signal Segmentation Methods

Authors: Kupryjanow, Adam; Czyzewski, Andrzej

Many researchers have developed algorithms for speech-signal segmentation in such applications as automatic speech recognition, speech coding, echo cancellation, automatic noise reduction, signal-to-noise ratio estimation, nonuniform time-scale modification, estimating the speech rate, automatic language identification, and speaker emotion recognition. Unlike algorithms that work offline, the authors developed two algorithms for real-time speech analysis as applied to detection of vowel-regions and estimation of speech rate. The accuracy, reliability, and real-time performance of these algorithms were evaluated in samples of Polish speech; experimental results showed that they performed equal to or better than existing offline approaches.

Perceptual Objective Quality Evaluation Method for High Quality Multichannel Audio Codecs

Authors: Seo, Jeong-Hun; Chon, Sang Bae; Sung, Keong-Mo; Choi, Inyong

In order to avoid the high cost of subjective listening tests for evaluating sound quality, objective assessment methods based on psychoacoustics have been routinely used. This research explores an assessment method for evaluating high-quality, multichannel audio codecs with a model that incorporates five monaural Model Output Variables (MOV) combined with four novel MOVs for predicting degradation of spatial attributes. When trained and verified with a listening-test data base of high-quality audio codecs, the model was able to predict small amounts of perceptual differences between test and reference signals in both spatial and timbre qualities.

Headphone-Based Virtual Spatialization of Sound with a GPU Accelerator

Authors: Belloch, Jose A.; Ferrer, Miguel; Gonzalez, Alberto; Martinez-Zaldivar, F.J.; Vidal, Antonio M.

This paper describes the design of a binaural headphone-based multisource spatial-audio application using a Graphical Processing Unit (GPU) as the compute engine. It is a highly parallel programmable coprocessor that provides massive computation power when the algorithm is properly parallelized. To render a sound source at a specific location, audio samples must be convolved with Head Related Impulse Responses (HRIR) filters for that location. A data base of HRIR at fixed spatial positions is used. Solutions have been developed to handle two problems: synthesizing sound sources positions that are not in the HRIR database, and virtualizing the movement of the sound sources between different positions. The GPU is particularly appropriate for simultaneously executing multiple convolutions without overloading the main CPU. The results show that the proposed application is able to handle up to 240 sources simultaneously when all sources are moving.

Audio Pitch Shifting Using the Constant-Q Transform

Authors: Schörkhuber, Christian; Klapuri, Anssi; Sontacchi, Alois

Pitch shifting of polyphonic music is usually performed by manipulating the time–frequency representation of the input signal such that frequency is scaled by a constant and time duration remains unchanged. A method for pitch shifting is proposed that exploits the logarithmic frequency-bin spacing of the Constant-Q Transform (CQT). Pitch-scaling of monophonic and dense polyphonic music signals is achieved by a simple linear translation of the CQT representation followed by a phase update stage. This approach provides a natural solution to the problems of transients because the CQT has good time resolution at high frequencies while interference between tonal components at low frequencies is reduced. Performing pitch shifting directly in the frequency domain allows the algorithm to process only parts of the signal while leaving other parts unchanged. Audio examples demonstrate the quality of the proposed algorithm for scaling factors up to an octave.

Deriving the transfer function of electroacoustical systems in a normal reverberant environment is usually based on frequency-domain methods that compute the anechoic spectra of direct and early sounds. Unlike time-delay spectrometry and time-gating frequency analysis, this paper proposes a Wiener-Hopf solution for time-gating in anechoic transfer-function measurements. The procedure eliminates scattered or reflected rays. Compared to methods based on the discrete Fourier transform, this approach allows for nonuniform frequency-domain sampling when the system’s magnitude and/or phase responses exhibit large variations. Since the approach relies only on frequency-domain measurements and simple time-gating, it does not require complicated and expensive equipment such as pulse generators and fast sampling oscilloscopes.

Standards and Information Documents

AES Standards Committee News


[Feature] Game audio has reached a state of maturity that implies parity of status with graphics. The most recent research in the field is concerned with connecting sound design to other game creation processes in a more integrated fashion and with devising adaptive sound design tools to increase emotional involvement and interactivity.

134th Convention Report, Rome

135th Convention Preview, New York

135th Exhibitor Previews

53rd Call for Papers, London

54th Call for Papers, London


Advertiser Internet Directory

AES Conventions and Conferences


Table of Contents

Cover & Sustaining Members List

AES Officers, Committees, Offices & Journal Staff

Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.

Choose your country of residence from this list:

Skip to content