You are currently logged in as an
Institutional Subscriber.
If you would like to logout,
please click on the button below.
Home / Publications / Journal-Online
Only AES members and Institutional Journal Subscribers can download
*Only AES members and Institutional Journal Subscribers can download.
Authors: Liu, Yunyi; Jin, Craig
Generating sound effects with controllable variations is a challenging task, traditionally addressed using sophisticated physical models that require in-depth knowledge of signal processing parameters and algorithms. In the era of generative and large language models, text has emerged as a common, human-interpretable interface for controlling sound synthesis. However, the discrete and qualitative nature of language tokens makes it difficult to capture subtle timbral variations across different sounds. In this research, the authors propose a novel similarity-based conditioning method for sound synthesis, leveraging differentiable digital signal processing. This approach combines the use of latent space for learning and controlling audio timbre with an intuitive guiding vector, normalized within the range [0, 1], to encode categorical acoustic information. By utilizing pretrained audio representation models, this method achieves expressive and fine-grained timbre control. To benchmark their approach, the authors introduce two sound effect datasets—Footstep-set and Impact-set—designed to evaluate both controllability and sound quality. Regression analysis demonstrates that the proposed similarity score effectively controls timbre variations and enables creative applications such as timbre interpolation between discrete classes. This work provides a robust and versatile framework for sound effect synthesis, bridging the gap between traditional signal processing and modern machine learning techniques.
Authors: Švento, Michal; Moliner, Eloi; Juvela, Lauri; Wright, Alec; Välimäki, Vesa
The restoration of nonlinearly distorted audio signals, alongside the identification of the applied memoryless nonlinear operation, is studied. The paper focuses on the difficult but practically important case in which both the nonlinearity and the original input signal are unknown. The proposed method uses a generative diffusion model trained unconditionally on guitar or speech signals to jointly model and invert the nonlinear system at inference time. Both the memoryless nonlinear function model and the restored audio signal are obtained as output. Examples of successful blind estimation of hard and soft-clipping, digital quantization, half-wave rectification, and wavefolding nonlinearities are presented. The results suggest that, out of the nonlinear functions tested here, the Cubic Catmull-Rom spline is best suited to approximating these nonlinearities. In the case of guitar recordings, comparisons with informed and supervised restoration methods show that the proposed blind method is at least as good as they are in terms of objective metrics. Experiments on distorted speech show that the proposed blind method outperforms general-purpose speech enhancement techniques and restores the original voice quality. The proposed method can be applied to memoryless audio effects modeling, restoration of music and speech recordings, and characterization of analog recording media.
Authors: Hu, Xuyi; Li, Jian; Picinali, Lorenzo; Hogg, Aidan O. T.
Accurate head-related transfer functions (HRTFs) are essential for delivering realistic 3D audio experiences. However, obtaining personalized, high-resolution HRTFs for individual users is a time-consuming and costly process, typically requiring extensive acoustic measurements. To address this, spatial upsampling techniques have been developed to estimate high-resolution HRTFs from sparse, low-resolution acoustic measurements. This paper presents a novel approach that leverages the spherical harmonic domain and an autoencoder generative adversarial network to tackle the HRTF upsampling problem. Comprehensive evaluations are conducted using both perceptual models and objective spectral metrics to validate the accuracy and realism of the upsampled HRTFs. The results show that the proposed approach outperforms traditional barycentric interpolation in terms of log-spectral distortion, particularly in extreme sparsity scenarios involving fewer than 12 measurements. These results go some way to justifying that the proposed autoencoder generative adversarial network approach is able to create high-quality, high-resolution HRTFs from only a few acoustic measurements, helping pave the way for more accessible personalized spatial audio across a range of applications.
Authors: Dourou, Nefeli A.; Bruschi, Valeria; Cecchi, Stefania
Binaural audio rendering is a technique used to create immersive sound experiences by simulating the way humans perceive spatial audio. Typically performed using headphones, binaural reproduction is widely applied in virtual reality, gaming, and 3D audio experiences, enhancing realism and engagement. However, headphones may not always be convenient, leading to the use of loudspeakers for binaural playback. In a loudspeaker setup, the crosstalk effect occurs when audio intended for one ear is heard by the other, distorting spatial perception. To address this, crosstalk cancellation algorithms are applied to ensure accurate binaural audio delivery. One promising application is binaural audio headrest systems for vehicles, enhancing audio experiences during travels. This paper aims to examine the impact of loudspeaker positioning in this system, considering the application of a crosstalk cancellation algorithm and evaluating placement and configuration of loudspeakers that are also critical for an effective cancellation. Subjective and objective analyses have demonstrated that crosstalk cancellation is necessary to improve the quality of a binaural audio headrest that probably will represent the listening space of the future.
Authors: Combes, Paolo; Weinzierl, Stefan; Obermayer, Klaus
Deep learning appears as an appealing solution for automatic synthesizer programming (ASP), which aims to assist musicians and sound designers in programming sound synthesizers. However, integrating software synthesizers into training pipelines is challenging due to their potential nondifferentiability. This work tackles this challenge by introducing a method to approximate arbitrary synthesizers. Specifically, a neural network is trained to map synthesizer presets onto an audio embedding space derived from a pretrained model. This facilitates the definition of a neural proxy that produces compact yet effective representations, thereby enabling the integration of audio embedding loss into neural-based ASP systems for black-box synthesizers. The authors evaluate the representations derived by various pretrained audio models in the context of neural-based methods for ASP and assess the effectiveness of several neural network architectures, including feedforward, recurrent, and transformer-based models, in defining neural proxies. The proposed method is evaluated using both synthetic and handcrafted presets from three popular software synthesizers and assessed its performance in a synthesizer sound-matching downstream task. Although the benefits of the learned representation are nuanced by resource requirements, encouraging results were obtained for all synthesizers, paving the way for future research into the application of synthesizer proxies for neural-based ASP systems.
Authors: Naal-Ruiz, Norberto E.; Ibarra-Zarate, David I.; Alonso-Valerdi, Luz M.
The increasing research in extended reality technologies has driven significant progress in developing virtual auditory experiments. These efforts aim to create immersive virtual environments, yet challenges such as room divergence—mismatches between virtual and physical auditory spaces—remain critical. This study focused on calibrating a binaural-rendered listening room to match the acoustic parameters of a physical neuroengineering laboratory. Acoustic calibration of a binaural renderer plugin demonstrated reduced imprecision values across parameters, confirming the feasibility of software adjustments to replicate the behavior of physical spaces in extended reality environments. Subjective evaluations by 13 expert listeners revealed a trade-off between timbral and spatial quality, with one condition showing the least degradation in spatial quality and having the worst timbral quality score. The findings emphasize the importance of calibration, validation, and further research to enhance the performance of binaural renderers. This work provides a foundation for using auditory extended reality tools in multidisciplinary research, particularly neuroacoustics, enabling real-time collection of electrophysiological data in controlled, virtual environments.
Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.