Journal of the Audio Engineering Society

2016 July/August - Volume 64 Number 7/8


Understanding how a human mixing engineer functions is necessary for the design of intelligent music production tools. The goal of such tools is to generate mixes that could realistically have been created by a human mix-engineer. This paper presents an analysis of 1501 mixes, over 10 different songs, created by mix-engineers. The number of mixes of each song ranged from 97 to 373. A variety of objective signal features were extracted and principal component analysis was performed revealing four dimensions of mix-variation, which can be described as amplitude, brightness, bass, and width. Feature distribution suggests multimodal behavior dominated by one specific mode. This distribution appears to be independent of the choice of song, but with variation in modal parameters. This is then used to obtain general trends and tolerance bounds for these features. The results presented here are useful as parametric guidance for intelligent music production systems. This provides insight into the creative decision making processes of mix-engineers.

Organizing a sonic space through vocal imitations

Authors: Rocchesso, Davide; Mauro, Davide Andrea; Drioli, Carlo


This research investigates how the vocal mimicking capabilities of humans may be exploited to access and explore a given sonic space. Experiments showed that prototype vocal sounds can be represented in a two-dimensional space and still remain perceptually distinct from each other. Experiments provide a measure of how meaningful the machine distribution and grouping of vocal sounds are to humans, and confirms that humans are able to effectively use the acoustic and articulatory cues at their disposal to associate sounds to given prototypes. When used in an automatic clustering process, these cues are sufficiently consistent with those used by humans when categorizing acoustic phenomena. The procedure of dimensionality reduction and clustering is demonstrated in the case of imitations of engine sounds, which then represent the sonic space of a motor sound model. A two-dimensional space is particularly attractive for sound design because it can be used as a sonic map where the landmarks contain both a synthetic sound and its vocal imitation.

A soundscape recording captures the sonic environment at a given location at a given time using one or more fixed or moving microphones. In most cases, the soundscape is uncontrolled and unscripted. Human listeners experience sonic components as being either background or foreground depending on their salient perceptual characteristics, such as proximity, repetition, and spectral attributes. Analyzing soundscapes in research tasks requires the classification and segmentation of the important sonic components, but that process is time consuming when done manually. This research establishes the background and foreground classification task within a musicological and soundscape context and then presents a method for the automatic segmentation of soundscape recordings. Using a soundscape corpus with ground truth data obtained from a human perception study, the analysis shows that participants have a high level of agreement on the category assigned to background samples (92.5%), foreground samples (80.8%), and background with foreground samples (75.3%). Experiments demonstrate how smaller window sizes affect the performance of the classifier.

Multipath Beat Tracking

Authors: Giorgi, Bruno Di; Zanoni, Massimiliano; Böck, Sebastian; Sarti, Augusto

Most music compositions evolve according to an underlying unit of time, sometimes implied and sometimes audible, called the beat. Beat trackers are essential components of rhythmic analysis systems in a wide range of applications involving musical information extraction. The authors describe a new methodology for tracking the sequence of beat instants given the ODF (Onset Detection Function) and the IBI (Inter Beat Interval) estimate. This alternate solution is based on a divide-and-conquer strategy that concurrently manages a multitude of simpler trackers from a rule-based perspective, effectively performing the search only for the most promising beat candidates. As confirmed by the experimental results, the performance of this method improves when the number of simple trackers (paths) is increased. Compared to dynamic programming, this approach is preferable in terms of computational efficiency while yielding accuracy scores that are comparable with many known beat trackers.

An Intelligent Interface for Drum Pattern Variation and Comparative Evaluation of Algorithms

Authors: Vogl, Richard; Leimeister, Matthias; Nuanáin, Carthach Ó; Jordà, Sergi; Hlatky, Michael; Knees, Peter


Drum tracks for electronic dance music are a central and style-defining element. But creating them can be a cumbersome task because of a lack of appropriate tools and input devices. The authors created a tool that supports musicians in an intuitive way for creating variations of drum patterns or finding inspiration for new patterns. Starting with a basic seed pattern provided by the user, a list of variations with varying degrees of similarity to the seed is generated. The variations are created using one of the three algorithms: a similarity-based lookup method using a rhythm pattern database, a generative approach based on a stochastic neural network, and a genetic algorithm using similarity measures as target function. Expert users in electronic music production evaluated aspects of the prototype and algorithms. In addition, a web-based survey was performed to assess perceptual properties of the variations in comparison to baseline patterns created by a human expert. The study shows that the algorithms produce musical and interesting variations and that the different algorithms have their strengths in different areas.

Improving Multilingual Interaction for Consumer Robots through Signal Enhancement in Multichannel Speech

Authors: Tsardoulias, Emmanouil; Thallas, Aristeidis G.; Symeonidis, Andreas L.; Mitkas, Pericles A.

In order for social robots to be truly successful, they need the ability to orally communicate with humans, providing feedback and accepting commands. Social robots need automatic speech recognition (ASR) tools that function with different users, using different languages, voice pitches, pronunciations, and speech speeds over a wide range of sound and noise levels. This paper describes different methodologies for voice activity detection and noise elimination when used with ASR-based oral interaction within an affordable budget robot. Acoustically quasi-stationary environments are assumed, which in conjunction with the high background noise of the robot’s microphones makes the ASR challenging. This work has been performed in the context of project RAPP, which attempts to deliver a cloud repository of applications and services that can be utilized by heterogeneous robots, aiming at assisting people with a range of disabilities. Results show that noise estimation and elimination techniques are necessary for successfully performing ASR in environments with quasi-stationary noise.

On the Impact of The Semantic Content of Sound Events in Emotion Elicitation

Authors: Drossos, Konstantinos; Kaliakatsos-Papakostas, Maximos; Floros, Andreas; Virtanen, Tuomas


Sound events are known to have an influence on the listener’s emotions, but the reason for this influence is less clear. Take for example the sound produced by a gun firing. Does the emotional impact arise from the fact that the listener recognizes that a gun produced the sound (semantic content) or does it arise from the attributes of the sound created by the firing gun? This research explores the relation between the semantic similarity of the sound events and the elicited emotions. Results indicate that the semantic content seems to have a limited role in the conformation of the listener’s affective states. However, when the semantic content is matched to specific areas in the Arousal-Valence space or when the source’s spatial position is considered, the effect of the semantic content is higher, especially for the cases of medium to low valence and medium to high arousal or when the sound source is at the lateral positions of the listener’s head.

Existing methods to index and search audio documents are generally based on text metadata and text-based search engines, but this approach is often problematic and time consuming because the text label does not necessarily describe the audio content. Query by Example (QBE) is an alternative approach for improving the effectiveness and efficiency of sound retrieval. In this research, the authors propose a novel approach for sound query by vocal imitation. Vocal imitation is commonly used in human communication and can be employed for human-computer interaction. Two proposals are suggested: (1) a supervised system that trains a multiclass classifier using training vocal imitations of different sound classes in the library and classifies a new imitation query into one of the classes; (2) an unsupervised system that is more flexible because it measures the feature distance between the imitation query and each sound in the library, returning sounds most similar to the query. Such systems require an effective feature representation of imitation queries and sounds in the library. Existing handcrafted audio features may not work well given the variety of vocal imitations and the mismatch between vocal imitations and actual sounds. It is therefore proposed to learn feature representations from training vocal imitations automatically using a Stacked Auto-Encoder (SAE). Experiments show that sound retrieving performance by automatically learned features outperform those using carefully handcrafted features in both supervised and unsupervised settings.


[Feature] As immersive audio systems and production techniques gain greater prominence in the market, the need for cost-effective solutions becomes apparent. Existing systems need to be adapted to enable object-based production techniques. “Ideal” reproduction solutions are having to be rationalized for practical purposes. Headphones offer one possible destination for immersive content, without excessive hardware requirements.

140th Convention Report, Paris

Download: PDF (1.37 MB)

140th Convention Exhibitors and Sponsors

Download: PDF (216.42 KB)

141st Convention Preview, Los Angeles

Download: PDF (452.14 KB)

141st Convention Exhibitor and Sponsor Preview

Download: PDF (762.48 KB)

2016 Conference on Audio for Virtual and Augmented Reality Preview, Los Angeles

Download: PDF (387.34 KB)

140th Convention Paper Abstracts

Download: PDF (313.52 KB)

AES Bylaws

Download: PDF (58.36 KB)


Book Review

Download: PDF (55.44 KB)

Section News

Download: PDF (221.25 KB)

AES Conventions and Conferences

Download: PDF (139.18 KB)


Table of Contents

Download: PDF (41.18 KB)

Cover & Sustaining Members List

Download: PDF (76.82 KB)

AES Officers, Committees, Offices & Journal Staff

Download: PDF (74.03 KB)

Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.

Choose your country of residence from this list:

Skip to content