You are currently logged in as an 
 Institutional Subscriber.
If you would like to logout, 
 please click on the button below.
Home / Publications / Journal-Online
Only AES members and Institutional Journal Subscribers can download
*Only AES members and Institutional Journal Subscribers can download.
Authors: Willemsen, Silvin; Bilbao, Stefan; Ducceschi, Michele; Serafin, Stefania
Several well-established approaches to physical modeling synthesis for musical instruments exist. Finite-difference time-domain methods are known for their generality and flexibility in terms of the systems one can model but are less flexible with regard to smooth parameter variations due to their reliance on a static grid. This paper presents the dynamic grid, a method to smoothly change grid configurations of finite-difference time-domain schemes based on sub-audio--rate time variation of parameters. This allows for extensions of the behavior of physical models beyond the physically possible, broadening the range of expressive possibilities for the musician. The method is applied to the 1D wave equation, the stiff string, and 2D systems, including the 2D wave equation and thin plate. Results show that the method does not introduce noticeable artifacts when changing between grid configurations for systems, including loss.
Download: PDF (HIGH Res) (1.6 MB)
Download: PDF (LOW Res) (541 KB)
Authors: Choi, Woosung; Jeong, Yeong-Seok; Kim, Jinsung; Chung, Jaehwa; Jung, Soonyoung; Reiss, Joshua D.
Label-conditioned source separation extracts the target source, specified by an input symbol, from an input mixture track. A recently proposed label-conditioned source separation model called Latent Source Attentive Frequency Transformation (LaSAFT)--Gated Point-Wise Convolutional Modulation (GPoCM)--Net introduced a block for latent source analysis called LaSAFT. Employing LaSAFT blocks, it established state-of-the-art performance on several tasks of the MUSDB18 benchmark. This paper enhances the LaSAFT block by exploiting a self-conditioning method. Whereas the existing method only cares about the symbolic relationships between the target source symbol and latent sources, ignoring audio content, the new approach also considers audio content. The enhanced block computes the attention mask conditioning on the label and the input audio feature map. Here, it is shown that the conditioned U-Net employing the enhanced LaSAFT blocks outperforms the previous model. It is also shown that the present model performs the audio-query--based separation with a slight modification.
Download: PDF (HIGH Res) (3.39 MB)
Download: PDF (LOW Res) (882.47 KB)
Authors: Darabundit, Champ C.; Abel, Jonathan S.; Berners, David
This article proposes a method suitable for discretizing continuous-time infinite impulse response filters with features near or above the Nyquist limit. The proposed method, called the Nyquist Band Transform (NBT), utilizes conformal mapping to pre-map a prototype continuous-time system, such that, when discretized through the bilinear transform, the discretized frequency response is effectively truncated at the Nyquist limit. The discretized system shows little frequency warping when compared with the original continuous-time magnitude response. The NBT is order-preserving, parametrizable, and agnostic to the original system's design. The efficacy of the NBT is demonstrated through a virtual analog modeling application.
Authors: Bennett, Christopher; Hopman, Stefan
Antiderivative antialiasing (ADAA) has emerged as a recent approach to reduce aliases for mathematically defined nonlinearities. In this study, ADAA is applied to simplified nonlinear Volterra modeling, which is a method for blackbox modeling of Hammerstein nonlinearities. Previously reported ADAA approaches contain a variable difference term in the denominator and therefore rely on a continuous piecewise function to prevent very small denominators. However, when applied to simplified Volterra models, this denominator term is eliminated, resulting in a polynomial function. This polynomial ADAA was tested against the standard approach of low-pass filtering the input to prevent aliasing. It was found that these two approaches perform comparably but that by combining them together, superior alias reduction can be achieved.
Authors: Steinmetz, Christian J.; Bryan, Nicholas J.; Reiss, Joshua D.
This work presents a framework to impose the audio effects and production style from one recording to another by example with the goal of simplifying the audio production process. A deep neural network was trained to analyze an input recording and a style reference recording and predict the control parameters of audio effects used to render the output. In contrast to past work, this approach integrates audio effects as differentiable operators, enabling backpropagation through audio effects and end-to-end optimization with an audio-domain loss. Pairing this framework with a self-supervised training strategy enables automatic control of audio effects without the use of any labeled or paired training data. A survey of existing and new approaches for differentiable signal processing is presented, demonstrating how each can be integrated into the proposed framework along with a discussion of their trade-offs. The proposed approach is evaluated on both speech and music tasks, demonstrating generalization both to unseen recordings and even sample rates different than those during training. Convincing production style transfer results are demonstrated with the ability to transform input recordings to produced recordings, yielding audio effect control parameters that enable interpretability and user interaction.
Download: PDF (HIGH Res) (1.43 MB)
Download: PDF (LOW Res) (450.06 KB)
Authors: Lindfors, Joel; Liski, Juho; Välimäki, Vesa
When a person listens to loudspeakers, the perceived sound is affected not only by the loudspeaker properties but also by the acoustics of the surroundings. Loudspeaker equalization can be used to correct the loudspeaker-room response. However, when the listener moves in front of the loudspeakers, both the loudspeaker response and room effect change. In order for the best correction to be achieved at all times, adaptive equalization is proposed in this paper. A loudspeaker-correction system using the listener's current location to determine the correction parameters is proposed. The position of the listener's head is located using a depth-sensing camera, and suitable equalizer settings are then selected based on measurements and interpolation. After correcting for the loudspeaker's response at multiple locations and changing the equalization in real time based on the user's location, a loudspeaker response with reduced coloration is achieved compared to no calibration or conventional calibration methods, with the magnitude-response deviations decreasing from 10.0 to 5.6 dB within the passband of a high-quality loudspeaker. The proposed method can improve the audio monitoring in music studios and other occasions in which a single listener is moving in a restricted space.
Download: PDF (HIGH Res) (5.22 MB)
Download: PDF (LOW Res) (727.47 KB)
Authors: Cámara, Mateo; Blanco, José Luis
This paper analyzes the impact of signal phase handling in one of the most popular architectures for the generative synthesis of audio effects: variational autoencoders (VAEs). Until quite recently, autoencoders based on the Fast Fourier Transform routinely avoided the phase of the signal. They store the phase information and retrieve it at the output or rely on signal phase regenerators such as Griffin--Lim. We evaluate different VAE networks capable of generating a latent space with intrinsic information from signal amplitude and phase. The Modulated Complex Lapped Transform (MCLT) has been evaluated as an alternative to the Short-Time Fourier Transform (STFT). A novel database on beats has been designed for testing the architectures. Results were objectively assessed (reconstruction errors and objective metrics approximating opinion scores) with autoencoders on STFT and MCLT representations, using Griffin--Lim phase regeneration, multichannel networks, as well as the Complex VAE. The autoencoders successfully learned to represent the phase information and handle it in a holistic approach. State-of-the-art quality standards were reached for audio effects. The autoencoders show a remarkable ability to generalize and deliver new sounds, while overall quality depends on the reconstruction of phase and amplitude.
Authors: Cheshire, Matthew; Drysdale, Jake; Enderby, Sean; Tomczak, Maciej; Hockman, Jason
The ability to perceptually modify drum recording parameters in a post-recording process would be of great benefit to engineers limited by time or equipment. In this work, a data-driven approach to post-recording modification of the dampening and microphone positioning parameters commonly associated with snare drum capture is proposed. The system consists of a deep encoder that analyzes audio input and predicts optimal parameters of one or more third-party audio effects, which are then used to process the audio and produce the desired transformed output audio. Furthermore, two novel audio effects are specifically developed to take advantage of the multiple parameter learning abilities of the system. Perceptual quality of transformations is assessed through a subjective listening test, and an objective evaluation is used to measure system performance. Results demonstrate a capacity to emulate snare dampening; however, attempts were not successful in emulating microphone position changes.
Download: PDF (HIGH Res) (3.14 MB)
Download: PDF (LOW Res) (676.75 KB)
Authors: Venkatesh, Satvik; Moffat, David; Miranda, Eduardo Reck
In recent years, machine learning has been widely adopted to automate the audio mixing process. Automatic mixing systems have been applied to various audio effects such as gain adjustment, equalization, and reverberation. These systems can be controlled through visual interfaces, audio examples being provided, usage of knobs, and semantic descriptors. Using semantic descriptors or textual information to control these systems is an effective way for artists to communicate their creative goals. In this paper, the novel idea of using word embeddings to represent semantic descriptors is explored. Word embeddings are generally obtained by training neural networks on large corpora of written text. These embeddings serve as the input layer of the neural network to create a translation from words to equalizer (EQ) settings. Using this technique, the machine learning model can also generate EQ settings for semantic descriptors that it has not seen before. The EQ settings of humans are compared with the predictions of the neural network to evaluate the quality of predictions. The results showed that the embedding layer enables the neural network to understand semantic descriptors. It was observed that the models with embedding layers perform better than those without embedding layers but still not as well as human labels.
Download: PDF (HIGH Res) (1.81 MB)
Download: PDF (LOW Res) (892.02 KB)
Authors: Elliott, Mitchell; Chon, Song Hui
Technological advances in music and audio engineering have brought forth a new method of audio mastering in the form of online automatic mastering services. However, many claim that the results from automatic mastering services are inferior to the work of professional human engineers. The presented investigation explores the perception of mastered products by popular mastering services found online. Music was submitted for mastering provided by two human mastering engineers and two automatic mastering services. In a listening test, subjects were asked to identify human-mastered samples. Later, subjects were asked to provide preference rankings among human-mastered and instant mastered samples. Furthermore, objective parameters pertaining to timbre and spectral energy distribution were calculated from stimuli. Subjects were unable to consistently identify human-mastered musical samples. Preference towards human-mastered samples was observed from jazz excerpts but not from rock excerpts. These results show partial support for claims of human mastering superiority based on preference. This study provides a new perspective on the perception of content from human and instant mastering, which may offer a first step to understanding many differences between the two services.
Authors: da Costa, Maurício do V. M.; Biscainho, Luiz W. P.
This paper describes a novel, low-cost method for combining time-frequency representations into a more sparse one. To this end, a new local quality measure that is based on an amplitude-weighted version of the so-called Hoyer sparsity is proposed. A detailed evaluation procedure that employs a dataset with nearly perfect f0 annotations of melodic signals and a set of white-noise pulses is adopted for assessing the time-frequency resolution attained. The proposed method is shown to produce state-of-the-art results among the existing combination methods in terms of energy concentration at frequency contours, onsets, and offsets, meeting the most desirable requirements: high time-frequency resolution, low computational cost, and the capability of combining representations with non-linear frequency scale.
Download: PDF (287.39 KB)
Download: PDF (287.39 KB)
Download: PDF (71.17 KB)
Download: PDF (71.17 KB)
Download: PDF (15.14 MB)
Download: PDF (130.81 KB)
Download: PDF (15.14 MB)
Download: PDF (130.81 KB)
Download: PDF (124.52 KB)
Download: PDF (36.24 KB)
Download: PDF (54.54 KB)
Download: PDF (124.52 KB)
Download: PDF (36.24 KB)
Download: PDF (54.54 KB)
Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.