Journal of the Audio Engineering Society

2024 July/August - Volume 72 Number 7/8


This paper presents results of a listening experiment evaluating three-degrees-of-freedom binaural reproduction of head-mounted microphone array signals. The methods are applied to an array of five microphones whose signals were simulated for static and dynamic array orientations.Methods under test involve scene-agnostic binaural reproduction methods as well as methods that have knowledge of (a subset of) source directions. The results of an instrumental evaluation reveal errors in the reproduction of interaural level and time differences for all scene-agnostic methods, which are smallest for the end-to-end magnitude-least-squares method. Additionally, the inherent localization robustness of the array under test and different simulated microphone arrays is investigated and discussed, which is of interest for a parametric reproduction method included in the study. In the listening experiment, the end-to-end magnitude-least-squares reproduction method outperforms other scene-agnostic approaches. Above all, linearly constrained beamformers using known source directions in combination with the end-to-end magnitude-least-squares method outcompete the scene-agnostic methods in perceived quality, especially for a rotating microphone array under anechoic conditions.


The spatial sampling of binaural room transfer functions that vary with listener movements, as required for rendering personal sound zone (PSZ) with head tracking, was experimentally investigated regarding its dependencies on various factors. Through measurements of the binaural room transfer functions in a practical PSZ system with either translational or rotational movements of one of the two mannequin listeners, the PSZ filters were generated along the measurement grid and then spatially downsampled to different resolutions, at which the isolation performance of the system was numerically simulated. It was found that the spatial sampling resolution generally depends on factors such as the moving listener’s position, frequency band of the rendered audio, and perturbation caused by the other listener. More specifically, the required sampling resolution is inversely proportional to the distance either between two listeners or between the moving listener and the loudspeakers and is proportional to the frequency of the rendered audio. The perturbation caused by the other listener may impair both the isolation performance and filter robustness against movements. Furthermore, two crossover frequencies were found to exist in the system, which divide the frequency band into three sub-bands, each with a distinctive requirement for spatial sampling.

Singer and Audience Evaluations of a Networked Immersive Audio Concert

Authors: Cairns, Patrick; Rudzki, Tomasz; Cooper, Jacob; Hunt, Anthony; Steele, Kim; Acosta Martínez, Gerardo; Chadwick, Andrew; Daffern, Helena; Kearney, Gavin

At the 2023 AES International Conference on Spatial and Immersive Audio, a networked immersive audio concert was performed. A vocal octet connected over the Internet between York and Huddersfield and provided a performance that was auralized in the acoustics of BBC Maida Vale Studio 2. A live audience in Huddersfield experienced the concert with local singers on stage, remote singers auralized alongside, and virtual acoustics rendered on a multichannel array. Another audience in York listened to the concert on headphones. An evaluation of the networked concert experience of the performers and audience is presented in this paper. Results demonstrate that a generally high-quality experience was delivered. Audience response to immersive audio rating items demonstrates a variance in experience. Several aspects of the evaluation context are identified as relevant to this rating variance and discussed as open challenges for audio engineers.

This study investigates different interpolation techniques for spatially upsampling Binaural Room Impulse Responses (BRIRs) measured on a sparse grid of view orientations. In this context, the authors recently presented the Spherical Array Interpolation by Time Alignment (SARITA) method for interpolating spherical microphone array signals with a limited number of microphones, which is adapted for the spatial upsampling of sparse BRIR datasets in the present work. SARITA is compared with two existing nonparametric BRIR-interpolation methods and naive linear interpolation. The study provides a technical and perceptual analysis of the interpolation performance. The results show the suitability of all interpolation methods apart from linear interpolation to achieving a realistic auralization, even for very sparse BRIR sets. For angular resolutions of 30◦ and real-world stimuli, most participants could not distinguish SARITA from an artifact-free reference.

Effects of Torso Location and Rotation to HRTF

Authors: Johansson, Jaan; Mäkivirta, Aki; Malinen, Matti

The significance of representing realistic torso orientation relative to the head in the head-related transfer function (HRTF) is studied in this work. Actual head position relative to the torso is found for 195 persons. The effect of the head position in HRTF is studied by modifying the 3D model of a Kemar head-and-torso simulator geometry by translating the head relative to torso in up-down and forward-backward directions and rotating the torso. The spectral difference is compared to that seen in the closest matching actual persons. Forward-backward location of the head has the strongest influence in theHRTF. The spectral difference between the fixed and rotated torso spectra can exceed a 1-dB limit for all sound arrival azimuth directions when the torso rotation exceeds 10◦. The spectral difference decreases with increasing source elevation. A subjective listening test with personal HRTF demonstrates that the spectral effect of the torso rotation are audible as a sound color and location changes. The HRTF data in this work is found by calculating the sound field using the boundary element method and the 3D shape of the person acquired using photogrammetry.

This paper presents the practitioners’ perspective on mixing popular music in spatial audio, particularly Dolby Atmos and the binaural mixes generated by the Dolby and Apple renderers. It presents the results of a dual-stage study, which utilized focus groups with eight professional music producers and a questionnaire completed by 140 practitioners. Analysis revealed the continued influence of stereo approaches on mix engineers, partly due to its historical dominance as a production platform and consumers’ continued use of headphones. It was also found that core elements of popular music productions, such as snare drums, tom-tom drums, kick drums, bass guitars, main guitars, and vocals, were less likely to have binaural processing applied compared with other sources. It was also shown there were perceived differences in the suitability of spatial audio mixing for specific genres, with electronic dance music, jazz, pop, classical, and world music rated as the most suitable. Regarding the binaural renderers, there was less user satisfactionwith theApple device comparedwith Dolby’s, and this dissatisfaction manifested mainly in the need for more user control. Finally, mix engineers were very aware of the importance of their mixes translating to smaller speaker systems and headphone playback, in particular.

Engineering reports

Practical Implementation of Automated Next Generation Audio Production for Live Sports

Authors: Moulson, Aimée; Walley, Max; Grewe, Yannik; Oldfield, Rob; Shirley, Ben; Scuda, Ulli


Producing a high-quality audio mix for a live sports production is a demanding task for mixing engineers. The management of many microphone signals and monitoring of various broadcast feeds mean engineers are often stretched, overseeing many tasks simultaneously. With the advancements in Next Generation Audio codecs providing many appealing features, such as interactivity and personalization to end users, consideration is needed as not to create further work for production staff. Therefore, the authors propose a novel approach to live sports production by combining an object-based audio workflow with the efficiency benefits of automated mixing. This paper describes how a fully object-based workflow can be built from point of capture to audience playback with minimal changes for the production staff. This was achieved by integrating Next Generation Audio authoring from the point of production, streamlining the workflow, and thus removing the need for additional authoring process later in the chain. As an exemplar, the authors applied this approach to a Premier League football match in a proof-of-concept trial.

Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.

Choose your country of residence from this list:

Skip to content