You are currently logged in as an
Institutional Subscriber.
If you would like to logout,
please click on the button below.
Home / Publications / Journal-Online
Only AES members and Institutional Journal Subscribers can download
*Only AES members and Institutional Journal Subscribers can download.
Authors: Treybig, Lukas; Klein, Florian; Werner, Stephan; Amengual Garí, Sebastià V.; Göring, Steve
Acoustic augmented realities aim to incorporate virtual elements on top of real acoustic environments. A key question is how accurately virtual acoustics must match real acoustics to enable perceptually seamless integration. In six-degrees-of-freedom scenarios, where listeners can explore spaces, intraroom acoustic variation is of particular interest. This paper presents a method for modeling the perceptibility of acoustic changes in an acoustic augmented realities context using a generalized linear mixed model based on differences in room acoustic parameters computed from pairs of room impulse responses. An online listening test based on extensive acoustic measurements investigated perceptible differences between three acoustic configurations of the same room using identical source-receiver arrangements. A total of 4,753 ratings from 120 participants in an ABX-like format (a method where participants identify whether a presented stimulus X matches stimulus A or B) were used to train the generalized linear mixed model. The resulting model shows good predictive performance (Brier score: 0.19; AUC: 0.67) based on a fivefold cross-validation with 100 repetitions, although improvements are possible. However, the perceptibility of acoustic changes can vary significantly between dynamic and static listening scenarios. The paper discusses evaluation strategies for these aspects, examines the model’s generalizability, highlights current limitations, and outlines directions for future improvement.
Authors: Biberger, Thomas; Fleßner, Jan-Hendrik; Ewert, Stephan D.
Intrusive audio quality models typically compare “internal representations” of a reference and a test signal. These models are often optimized for the prediction of small signal degradations, where the test and reference signals are still highly correlated (waveform preserving distortions). However, differences between uncorrelated signals like two Gaussian-noise tokens or, for example, more complex, realistic signals in spatial audio reproduction schemes, that show only partial correlation (non–waveform preserving distortions) are not necessarily easy to distinguish by listeners. Despite this, current audio quality models typically predict large perceptual differences between such signals. Here, the decision back-end of a reference-based audio quality model was modified to account for this overestimation of signal quality differences. The suggested modifications were intended to effectively mimic short-term memory limitations by analyzing similarities in the differences between the internal representations of reference and test signals across time frames, auditory channels, and modulation channels. The modified model was evaluated with data based on different audio reproduction and room simulation methods and was compared to other state-of-the-art audio quality models. Results support the need for modifications of state-of-the-art audio quality models to accurately predict the perceptual effects of non–waveform preserving distortions.
Authors: Himmelein, Hendrik; Lübeck, Tim; Bau, David; Pörschmann, Christoph
Plausibility has become an established concept for evaluating the overall quality of audio rendering in extended reality. Several studies have assessed plausibility using measured binaural room impulse responses (BRIRs). However, BRIRs measured with an artificial head often suffer from limitations such as high measurement effort and limited individualization. Thus, the present study compares two alternative auralization approaches: an Ambisonics-based method and a parametric approach, using a measured BRIR condition as a reference. The results did not reveal significant differences in overall plausibility between the three conditions. Only slight differences were observed between source positions, which may be attributed to the limited head movement and specific room properties. These findings suggest that the choice of the auralization method has only a minor impact on perceived plausibility. Implications of these results for future research and auralization applications are discussed.
Authors: Diaz, Rodrigo; De La Vega Martin, Carlos; Sandler, Mark
Simulating the dynamics of strings and membranes is a central challenge in sound synthesis that is traditionally tackled with numerical methods like modal synthesis, which are efficient and accurate when the governing partial differential equations and parameters are known. However, when such information is unavailable, data-driven approaches become essential. Neural networks offer a viable alternative, with recurrent architectures providing real-time synthesis but often suffering from training instability and limited scalability. State-space models address these issues by combining linear dynamics with nonlinear transformations across sequences, enabling parallelization and greater efficiency. This paper introduces a data-driven framework for modeling vibrational dynamics, focusing on interpretability, efficiency, and accuracy. The proposed approach combines a linear time-invariant process with a neural modulation network. The linear time-invariant process captures global dynamics via optimized eigenvalues, allowing direct control over vibrational characteristics such as frequency and decay. The modulation network refines these dynamics by introducing local nonlinear adjustments, ensuring adaptability to complex behaviors. This architecture, therefore, unifies efficient learning, interpretable control, and accurate reconstruction of nonlinear vibrational dynamics.
Authors: Wirler, Stefan; Pulkki, Ville
This paper proposes a spatial post-filter for speech enhancement in reverberant environments, such as meeting rooms and lecture halls, using widely distributed microphones. The method operates in the time-frequency domain by aggregating phase-corrected cross-spectral similarities between microphone pairs to form a unified spatial filter. This filter can be applied to omnidirectional or beamformed signals to extract a target source from a mixture while reducing the influence of spatial aliasing. The approach is evaluated in simulated multitalker scenarios and validated through objective performance metrics and subjective listening tests under varied acoustic conditions. Results demonstrate robust suppression of interference and consistent enhancement of the desired source.
Authors: Kumar, Akash; Puri, Amrita
When synthesizing a dynamic sound field using a multichannel loudspeaker array, incorporating moving sound scenes enhances the immersive experience for listeners. Previous studies have explored the synthesis of a moving sound source situated outside the region enclosed by an array of loudspeakers, generally termed a nonfocused moving sound source. This paper presents the synthesis of a moving sound source inside the region enclosed by a loudspeaker array, termed a focused moving source. The driving signals for the loudspeakers are derived using the principle of wave field synthesis. Computer simulations have been conducted to evaluate the effectiveness of the proposed driving signals in synthesizing a focused moving point source.
Authors: Zhao, Yilin; Chen, Simiao; Pan, Keyu; Shen, Yong
The creep effect in the suspension system of microspeakers significantly influences their low-frequency displacement dynamics. Although theoretical models considering the creep effect have been established, the lack of a corresponding parameter identification method has hindered their practical implementation in loudspeaker active control systems. This paper presents a comprehensive parameter identification framework for microspeakers incorporating the creep effect. The creep-related parameters R2 and K2 are efficiently estimated through the linear displacement transfer function, followed by time-domain identification of other system parameters under nonlinear conditions. Experimental results show that, compared with models that neglect the creep effect, the proposed method markedly improves displacement prediction accuracy at low frequencies, highlighting its strong potential for practical applications.
Authors: Maher, Robert C.
This is a case study describing the rapid audio forensic assessment of recordings from the assassination attempt against presidential candidate Donald Trump in Butler, PA, on Saturday, July 13, 2024. Audio from the podium microphones revealed gunshot sound sequences of ballistic shock waves and muzzle blast sounds consistent with rifle bullets traveling at supersonic speed. Additional independent audio evidence was captured by cell phone videos recorded by individuals in the crowd at the time of the shooting. The acoustical consistency between the podium microphone signal and the audio from the numerous user-generated cell phone videos provided strong confidence of authenticity. Within a few hours of the incident, audio forensic analysis identified ten audible gunshots, with the first eight coming from a single position consistent with the subsequent identification of the would-be assassin’s body and two other shots attributable to law enforcement officers.
Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.