Journal of the Audio Engineering Society

2019 April - Volume 67 Number 4


This paper proposes a novel approach for rhythmic analysis of recorded percussion music based on information theory. Given an audio recording of a percussion music performance, the algorithm computes a lossy representation that captures much of its underlying regularity but tolerates some amount of distortion. Within a rate-distortion theory framework, the trade-off between rate and distortion allows for the extraction of some relevant information about the performance. Downbeat detection is addressed using lossy coding of an accentuation feature under rate-distortion criteria assuming the correct alignment produces the simplest explanation for the data. Experiments were conducted in order to assess the usefulness of the proposed approach when applied to a dataset of candombe drumming audio recordings. In particular, different performances were compared according to a measure of their overall complexity drawn from the operational rate-distortion curve, yielding results that roughly correspond to subjective judgment and correlate well with personal style and expertise.

Automated speaker recognition attains impressive reliability when tested under controlled laboratory acoustic conditions. However, the environmental noise that inevitably exits in many real-world speech samples causes considerable degradation of recognition accuracy due to the so-called “channel mismatch” that occurs between the enrollment and recognition phases. A new online training method is proposed to improve robustness of speaker recognition in noisy conditions. An estimate of the signal-to-noise ratio and an emulated ambient noise spectral profile found in the silence intervals of the speech signal are used to re-enroll the reference model for a claimed speaker to generate a new noisy reference model. Based on a large number of tests using two datasets for speech samples contaminated with cafeteria babble and street noise, the proposed method shows promising improvement. When the signal-to-noise ratio is higher than 20 dB, typical speaker recognition algorithms normally function well, and the use of the proposed online training does not offer any benefit. When the signal-to-noise ratio is below 15 dB, the proposed method improves robustness of recognition. However, the new method shows limitations with speech samples that have been contaminated with interior train noise. Train noise contains slow time-varying components that require prolonged observation to create a reliable estimate.

Dynamic Audio Reproduction with Linear Loudspeaker Arrays

Authors: Gálvez, Marcos F. Simón; Menzies, Dylan; Fazi, Filippo Maria

This report describes a dynamic, listener-adaptive, audio reproduction filtering system that allows a listener to move freely in front of a loudspeaker array without degrading perception. The proposed filters allow the delivery of two independent personalized signals to a pair of listeners or a binaural signal to a single listener. The filters are modified in real time according to the listener position. This is obtained by expressing the impulse response of each filter as a network of variable gain-delay elements that are modified so that the filters adapt the reproduction to the listener position, assuming that each loudspeaker behaves as a point-source free-field monopole. The paper introduces the filter formulation together with the signal processing scheme for a real-time implementation and measured performance for a listener-adaptive 28-loudspeaker linear array using an optical head-tracking system. The filter that controls the equalization of the acoustic pressure at the listener’s ears is an FIR filter that can be adjusted in real time. The measured personal audio performance shows that it is possible to obtain about 30 dB of crosstalk cancellation for two listeners between 300 Hz and 3 kHz. For binaural audio reproduction, the crosstalk cancellation between both of the listener’s ears is above 10 dB at 1 kHz and grows to 20 dB at 8 kHz.

Automating Mixing of User-Generated Audio Recordings from the Same Event

Authors: Stefanakis, Nikolaos; Mastorakis, Yannis; Alexandridis, Anastasios; Mouchtaris, Athanasios

When users attend the same public event, there may be multiple audiovisual recordings that are then posted on social media and websites. The availability of such a massive amount of user-generated recordings (UGR) has triggered new research directions related to the search, organization, and management of this content. And it has provided inspiration for new business models for content storage, retrieval, and consumption. The authors propose an approach to combine the available recordings based on a normalization step and a mixing step. The normalization step defines a fixed-with-time gain that is specific to each UGR. In the mixing step, a mechanism that reduces the master gain in accordance with the number of activated inputs at each time is employed. An approach called orthogonal mixing is presented, which is based on the assumption that the mixture components are mutually independent. The presented mixing process allows the combination of multiple short-duration UGRs to produce a longer audio stream, with potentially better quality than any one of its constituent parts. This property is exploited in the design of an automatic mixing process that exploits all the available audio recordings at each moment. Automatic mixing is then possible.

Engineering reports

An identification method has been proposed to obtain the thermal parameters of thermal models. For most loudspeakers, eddy current can be neglected in the low-frequency range and forced convection can be neglected in the high-frequency range. Therefore, the proposed method selects a partition frequency to neglect both of these factors and divides thefrequency range into two parts. The linear parameters are directly obtained at the partition frequency without the influence of forced convection and eddy current, making the method practical. As the linear parameters are obtained at the partition frequency, the selection of the partition frequency may cause deviations. Forced convection and eddy current are identified in the low- and high-frequency ranges, respectively. All thermal parameters are identified by employing the proposed method by measuring and fitting the temperature curves of the voice coil at several single frequencies. The temperature curves of single-tone, two-tone, and white noise signals are measured and compared with the predicted curves according to identified parameters. The results show that the curves predicted by the proposed method agree well with the measured curves, demonstrating the validity and accuracy of the method.

Standards and Information Documents

AES Standards Committee News


[Feature] Prominent among the themes of papers on audio perception presented at the 145th AES Convention was the topic of loudness in music listening, including the effects of hyper-compression, and whether one can reduce listening levels without people noticing. Another important topic area was the growing area of virtual reality and mobile listening and the ways in which auditory and visual stimuli interact.

Audio Forensics Conference Preview, Porto


AES Conventions and Conferences


Table of Contents

Cover & Sustaining Members List

AES Officers, Committees, Offices & Journal Staff

Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.

Choose your country of residence from this list:

Skip to content