Journal of the Audio Engineering Society

2016 November - Volume 64 Number 11


Low Bit-Rate Speech Codec Based on a Long-Term Harmonic Plus Noise Model

Authors: Ben Ali, Faten; Djaziri-Larbi, Sonia; Girin, Laurent

In speech/music coders and analysis/synthesis systems, spectral modeling is generally performed on a short-term (ST) frame-by-frame basis, which is justified by the fact that the signal is only locally (quasi-) stationary. The vocal tract configuration moves slowly and smoothly thereby resulting in a high correlation between the spectral parameters of successive frames: this correlation property is exploited in long-term modeling of the ST parameters, which however results in longer modeling/coding delays. The short delay constraint can be relaxed in many applications, such as text-to-speech modification/synthesis, telephony surveillance data, digital answering machines, electronic voicemail, digital voice logging, electronic toys, and video games. The long-term harmonic plus noise model (LT-HNM) for speech shows additional data compression possibilities since it exploits the smooth evolution of the time trajectories of the short-term harmonic plus noise model parameters by applying a discrete cosine model (DCM). In this paper, the authors extend the LT-HNM to a complete low bit-rate speech coder that is based on a long-term approach ca. 200ms. The proposed LT-HNM coder reaches a bit-rate of 2.7kbps for wideband speech.


Musical instrument sounds have distinct timbral and emotional characteristics that can change when audio processing is applied. This paper investigates the effects of MP3 compression on the emotional characteristics of eight sustained instrument sounds using listening tests. The experimental paradigm involved a pairwise comparison of compressed and uncompressed samples at several bit rates over ten emotional categories. The results showed that MP3 compression strengthened neutral and negative emotional characteristics such as Mysterious, Shy, Scary, and Sad, and weakened positive emotional characteristics such as Happy, Heroic, Romantic, Comic, and Calm. Angry was relatively unaffected by MP3 compression, probably because the background “growl” artifacts added by MP3 compression decreased positive emotional characteristics and increased negative characteristics such as Mysterious and Scary. Compression effected some instruments more and others less; trumpet was the most effected and the horn the least.

A Fifty-Node Lebedev Grid And Its Applications To Ambisonics

Authors: Lecomte, Pierre; Gauthier, Philippe-Aubert; Langrenne, Christophe; Berry, Alain; Garcia, Alexandre

Physical reconstruction or synthesis of three-dimensional sound fields can be implemented with Near Field Compensated Higher Order Ambisonics. This paper investigates the use of a fifty-node Lebedev grid, which is derived from rotationally-invariant quadrature rules. Special attention is paid to spatial aliasing artifacts at the capture and reproduction steps. While comparing a fifty-node Lebedev grid with a Fliege and a t-design grid that both use almost the same number of nodes, it is shown that the Lebedev grid provides the best performance in terms of sound field capture and reproduction. Finally, a multiband multiorder decoder is presented. These decoders take advantage of the inherent nested subgrids when following the rotationally-invariant quadrature approach. The importance of orthonormality of the spherical harmonics was highlighted in a context of physical encoding or reconstruction of a sound field with the Ambisonics approach. Simulation results are provided for the case of a three-band decoder using the three grids contained in the Lebedev grid. It was found that a multifrequency sound field can be reproduced accurately in the sweet-spot by using a combination of low-order decoder for low frequency and higher-order decoder for higher frequency.

To enhance the localization performance of listeners using head-related impulse responses HRIR datasets from dummy heads, individualization was added. An ellipsoidal model is used to adapt the Interaural Time Difference (ITD) of the dataset to individual subjects by using their anthropometric data. Head measurements from 23 subjects were used to validate the model. The ITD model is based on an ellipsoid shape and the analytical solution of the sound transmission around a sphere. A comparison of the measured and adapted ITDs shows that the average absolute error was 25 +/- 9 µs, which is a value below the just-noticeable difference. However, the ellipsoidal model underestimates the ITD. In contrast to similar approaches, this model calculates both azimuth- and elevation-dependent ITDs. Since the ITD of a given HRTF dataset is individualized, the shoulder reflections and the ear offset are maintained.

Low-Complexity Simultaneous Estimation of Head-Related Transfer Functions by Prediction Error Method

Authors: Kanai, Sekitoshi; Sugaya, Maho; Adachi, Shuichi; Matsui, Kentaro

This research explores a fast measurement method for computing a head-related transfer function (HRTF). The method uses a multidirectional intermediate directional transfer function (IDTF) with a multiple-input single-output structure. An ef?cient procedure is then used to calculate the model parameters. Experiments showed that a simultaneous estimation method made it possible to estimate IDTFs on the horizontal plane as accurately as those measured one by one in the frequency range from 375 to 19,875 Hz. Even though the IDTFs in the directions contralateral to each ear are dif?cult to estimate because of low signal-to-noise ratio, the estimated IDTFs preserved the spectral cues. The effectiveness of the proposed method was veri?ed through a simultaneous estimation experiment on a set of IDTFs of 24 directions measured using a dummy head. In this experiment, the intermediate directional impulse responses were approximated by the 128-order FIR models. Through the experiments, it was con?rmed that the average spectral distortion between the simultaneously estimated IDTFs and IDTFs measured one direction at a time was less than 1 dB in the frequency range from 375 to 19,875 Hz. The method can also be applied to room impulse responses.

Compact loudspeaker arrays can be driven so that the radiation patterns produce zones of private sound and zones of silence. This research compares the performance of two strategies, both based on the Pressure Matching Method (PMM), for accurate reproduction of a target signal. The first strategy is the Weighted PMM (WPMM) with low values for the weight of the reproduction error in the zone where accurate reproduction is not targeted. The second strategy is the Linearly-Constrained PMM (LCPMM), wherein a performance constraint on the accuracy of the target signal in a given zone is added to the cost function for the calculation of the input signals. Performance of the two methods was evaluated using numerical simulations of monopoles in free field and a linear array prototype with measured transfer functions in an anechoic environment. The two strategies were evaluated using a target signal with large amplitude variations between the so-called acoustically bright and dark zones. Results show that input signals designed with the WPMM provide better trade-offs between accuracy of the target field reproduction in the bright zone and directivity performance compared to that of the LCPMM.


Previous research has shown that both sustained and nonsustained musical instrument sounds have strong emotional characteristics. This report explores how the effects of pitch and dynamics influence the emotional characteristics of isolated one-second piano sounds. Listeners compared the sounds pairwise over ten emotion categories. The results showed that all ten emotional categories were significantly affected by pitch and nine of them by dynamics. In particular, the emotional characteristics Happy, Romantic, Comic, Calm, Mysterious, and Shy generally increased with pitch, but sometimes decreased at the highest pitches. The characteristics Heroic, Angry, and Sad generally decreased with pitch. Scary was strong in the extreme low and high registers. With regard to dynamics, the results showed that the characteristics Heroic, Comic, Angry, and Scary were stronger for loud notes, while Romantic, Calm, Mysterious, Shy, and Sad were stronger for soft notes. Surprisingly, Happy was not affected by dynamics. These results help quantify the emotional characteristics of piano sounds.


[Feature] Headphones are almost certainly now the dominant means by which many people listen to reproduced sound. Still, the quality of the devices used is often remarkably low, and there is a very wide range of frequency responses represented. The need for a new “preferred” target curve has been suggested. Loudspeaker reproduction can now be simulated on headphones in such a way that timbral coloration is minimized. Virtual room simulation, head tracking, and personalization, could make this even more successful. There is continued debate about whether consumer preference trumps objective accuracy.

2016 Audio for Virtual and Augmented Reality Conference Report, Los Angeles

3rd International Conference on Sound Reinforcement'Open Air Venues, Call for Contributions, Struer


AES Conventions and Conferences


Table of Contents

Cover & Sustaining Members List

AES Officers, Committees, Offices & Journal Staff

Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.

Choose your country of residence from this list:

Skip to content