Journal of the Audio Engineering Society

2018 May - Volume 66 Number 5


Reinforcement-Learning-Based Personalization of Head-Related Transfer Functions

Authors: Nambu, Isao; Washizu, Manabu; Morioka, Shuhei; Hasegawa, Yuta; Sakuma, Wataru; Yano, Shohei; Hokari, Haruhide; Wada, Yasuhiro

In order to perceive spatial locations of virtual sounds using stereo headphones, individual head-related transfer functions (HRTFs) are required for each listener. However, accurate HRTF measurement is usually difficult. While previous studies have proposed methods of HRTF personalization without HRTF measurement, localization errors often remain and further modifications are challenging. This research proposes a method that uses reinforcement learning and listener evaluation to obtain an accurate individual HRTF without measurement. The authors conducted a proof-of-concept simulation with an experiment involving human subjects. In the simulation, it was confirmed that the proposed method could acquire individual HRTFs close to the measured dummy-head HRTF. A learning experiment in one direction used the proposed method without individual HRTFs. The results showed improved horizontal-plane localization for the learned HRTF as compared to the dummy-head HRTF. These experiments collectively demonstrate the possibility of the proposed reinforcement-learning-based personalization method for individual HRTFs that enables listeners to experience accurate virtual sound environments.

Faster sensory profiling methods are gaining interest because they do not require a training phase and can be performed by trained or untrained assessors. As the 3rd article in a series, this paper considers the check-all-that-apply (CATA) method with high-end loudspeakers. A CATA question is a multiple-choice question where participants are presented with a list of descriptive attributes (words or phrases) and a product to be evaluated. Listeners are then asked to select all the attributes that they think apply to the product. A preliminary test was conducted with naïve assessors to reduce a list of possible attributes to a suitable number appropriate for a CATA question. A listening test was then conducted with naïve assessors using the CATA method, and two methods for characterizing the loudspeakers were explored using Correspondence Analysis (CA) and Hierarchical Cluster Analysis (HCA). CA produced bidimensional plots for each track, allowing characterization inferred from the positioning of loudspeakers and attributes. HCA provided four clusters of loudspeaker-track combinations, with attributes that best describe each cluster. The CATA method was able to produce discriminative and descriptive characterizations of the loudspeakers.

This paper analyzes the pitch fluctuations of different notes in Taiwanese singing in order to build an F0 note-type based control model that improves the naturalness of Taiwanese synthesized singing voice by producing the more natural F0 contours. The factors that significantly differentiate singing synthesis from speech synthesis must be taken into consideration when designing a singing synthesizer. Among these, the fundamental frequency (F0) contour is an important feature that deeply affects singing voice perception and needs to be controlled precisely. The F0 contour contains fluctuations instead of a predefined stepwise pitch curve derived from musical notes. These fluctuations are important features that should be taken into consideration in singing-related applications such as singing synthesis, singing voice detection, performance analysis, singing/music recognition, singing style identification, and query-by humming. Overshoot percentage and preparation percentage are proposed to solve the problems of determining the fluctuation extent. Statistics for each note category were established from a corpus of Taiwanese nursery rhymes. Different extents of the overshoot and preparation of separate categories of notes for males, females, and children were modeled according to the statistic results. A PID controller that controls a second-order system is proposed to quickly adjust to the correct F0 level of notes and remain sufficiently steady at the correct F0 level to produce a pleasant singing voice.

A fundamental aspect of loudspeaker modeling is the ability to calculate the time-domain response of a driver in free air or in a specified enclosure given prior calculation of the frequency response. This report presents a numerical method to compute the time-domain response of a loudspeaker or other transducer using contour integration of its frequency response. The approach is based on Weideman’s scheme for inversion of the Laplace transform along a parabolic contour, and it is applicable to analytic functions that contain isolated singularities (poles) and branch points in the left half-plane. The new approach is motivated by the need to implement viscoelastic and semi-inductance effects into linear and nonlinear, time-dependent transducer calculations. Because the response functions that describe these phenomena contain fractional power and logarithmic singularities, solution methods based on rational function decomposition cannot be used. The new method is simple to implement, requires few function evaluations, and is remarkably accurate. For analyzing the step response in box models, the proposed contour method based on Weideman can be universally applied with only a few additional lines of new code.

Acoustic Event Classification using Low-Resolution Multi-label Non-negative Matrix Deconvolution

Authors: Vuegen, Lode; Karsmakers, Peter; Vanrumste, Bart; Hamme, Hugo Van

With the increased proliferation of interconnected devices that have built-in microphones, acoustic event classification and monitoring becomes possible in a wide variety of applications, such as surveillance, healthcare, military, machine diagnostics, and wildlife tracking. The promise and success of these applications depends on robust sensing of acoustic events in the environment. Typically, sound event classes are defined by annotating training data, which is a laborious process. This work introduces an extended version of non-negative matrix deconvolution (NMD), called low-resolution multi-label non-negative matrix deconvolution (LRM-NMD), where both the observation data and the available labeling information are used during training. The proposed extension of NMD was successfully applied to the classification of acoustic events even in noisy conditions with overlapping events. Low-resolution, multi-labeling information simply indicates that the sound classes of the events take place over a longer period of time in the acoustic data without identifying beginning or endings of the individual events.


The mixing engineer and the reproduction system both influence the perception of a song by a listener. For unfamiliar music, the mix can significantly influence listener preferences. The goal of this research was to investigate the influence of the reproduction systems and mixing parameters by asking listeners for preference ratings in a paired comparison test. The study generated mixes of four different popular music recordings and introduced systematic changes to the wave field synthesis mix for one song. The mixing parameters were EQ, compression, reverb, and spatial positioning. Listeners rated their preference by comparing two channels with five channel stereophony and wave field synthesis using a circular array of 56 loudspeakers. Even when introducing relatively strong changes to the wave field synthesis mix, listeners still preferred that system most of the time. This preference was dependent on the actual content and might vary between different songs, or even song excerpts. The mixing condition disliked the most by listeners was a very wide arrangement of the foreground elements of popular music, such as vocals, snare and bass drum, and guitars. Overall, using a high number of loudspeakers is preferred by most listeners, and the differences between reproduction methods can have a larger influence than strong variations of single mixing parameters.

Standards and Information Documents

AES Standards Committee News


[Feature] Virtual reality (VR) is the big theme of the moment in audio, as it is in a number of other domains of media engineering. It’s joined by its partner, augmented reality (AR), which attempts to integrate elements of artificial environments with one’s experience of the real world. At the AES 143rd Convention sessions in the Game Audio and VR track, presenters looked at the challenges and opportunities, as well as the workflows needed for success in this fast-evolving field.


AES Conventions and Conferences


Table of Contents

Cover & Sustaining Members List

AES Officers, Committees, Offices & Journal Staff

Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.

Choose your country of residence from this list:

Skip to content