Journal of the Audio Engineering Society

2016 September - Volume 64 Number 9


Semantic Browsing of Sound Databases without Keywords

Authors: Lafay, Grégoire; Misdariis, Nicolas; Lagrange, Mathieu; Rossignol, Mathias

With the growing capability of recording and storage devices, the problem of indexing large audio databases has been the object of much attention. Most of this effort is dedicated to automatic inferences from indexed metadata. In contrast, browsing audio databases in an effective manner has been less considered. This report studies the relevance of a semantic organization of sounds to ease the browsing of a sound database. For such a task, semantic access to data is traditionally implemented by a keyword selection process. However, various limitations of written language, such as word polysemy, ambiguities, or translation issues, may bias the browsing process. Two sound presentation strategies organized sounds spatially to reflect an underlying semantic hierarchy. For the sake of comparison, the authors also considered a display whose spatial organization was only based on acoustic cues. Those three displays were evaluated in terms of search speed in a crowdsourcing experiment using two different corpora: environmental sounds from urban environments and sounds produced by musical instruments. Coherent results demonstrate the usefulness of an implicit semantic organization for representing sounds in terms of both search speed and of learning efficiency.

This study describes a listening experiment designed to further examine the previously proposed luminance-texture-mass (LTM) model for timbral semantics. Thirty two musically trained listeners rated twenty four instrument tones on six predefined semantic scales: brilliance, depth, roundness, warmth, fullness, and richness. These six scales were analyzed with Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) to produce two different timbre spaces. These timbre spaces were subsequently compared for their configurational and dimensional similarity with the LTM semantic space and the direct MDS perceptual space obtained from the same stimuli. The results showed that the selected semantic scales are adequately representing the LTM model and are fair at predicting the configurations of the sounds that result from pairwise dissimilarity ratings.

Automatic Soundscape Affect Recognition Using A Dimensional Approach

Authors: Fan, Jianyu; Thorogood, Miles; Pasquier, Philippe

Soundscape studies have demonstrated a variety of approaches for investigating how soundscapes affect is part of immersive experiences. This research tries to develop an automatic affect recognition system that soundscape composers can use to create emotional compositions to evoke a target mood. In addition, this system can offer sound designers a more streamlined workflow for creating suitable sound effects for films and can offer engineers a way to design mood-enabled recommendation systems for retrieval of soundscape recordings. This research uses ground truth data collected from an online survey, and an analysis of the corpus shows that participants have a high level of agreement on the valence and arousal of soundscapes. The authors then generated a gold standard by averaging user responses. The propose system obtained better results than an expert-user model.

Real-time control of the emotional content of sound has utility in video game soundtracking where the player controls the narrative trajectory, and the affective attributes of the sound should ideally match this trajectory. Perceived emotions can be represented in a 2-dimensional space composed of valence (positivity, e.g. happy, sad, fearful) and arousal (intensity, e.g. mild vs strong). This report is a speculative exploration of measuring and manipulating sound effects to achieve emotional congruence. An initial study suggests that timbral features can exert an influence on the perceived emotional response of a listener. A panel of listeners responded to stimuli in a set with varying timbres, while maintaining pitch, loudness, and other musical and acoustic features such as key, melodic contour, rhythm and meter, reverberant environment etc. The long term goal is to create an automated system that utilizes timbre morphing in real time to manipulate perceived affect in soundtrack generation.

Automatic music transcription transforms an acoustic music signal into a symbolic notation that typically involves the detection of multiple concurrent pitches, the detection of note onsets and offsets, as well as recognition of the instruments. This paper presents a novel method for transcribing folk music. In contrast to most commercial music, folk music recordings may contain various inaccuracies because they are usually performed by amateur musicians and recorded in the field. The proposed method fuses three sources of information: frame-based multiple F0 estimates, song structure, and pitch drift estimates. Using song structure can improve transcription accuracy. The method uses two strategies: exploiting repetitions aligned in the time and pitch domains for improving F0 estimates and incorporating a probabilistic model based on explicit duration hidden Markov models (EDHMM) to estimate notes from F0. A representative segment of the analyzed song is used to align other segments. Information from these segments is summarized and used in a two-layer probabilistic EDHMM to segment frame-based information into notes.

From Interactive to Adaptive Mood-Based Music Listening Experiences in Social or Personal Contexts

Authors: Barthet, Mathieu; Fazekas, György; Allik, Alo; Thalmann, Florian; B.Sandler, Mark


Listeners of audio are increasingly shifting to a participatory culture where technology allows them to modify and control the listening experience. This report describes the developments of a mood-driven music player, Moodplay, which incorporates semantic computing technologies for musical mood using social tags and informative and aesthetic browsing visualizations. The prototype runs with a dataset of over 10,000 songs covering various genres, arousal, and valence levels. Changes in the design of the system were made in response to user evaluations from over 120 participants in 15 different sectors of work or education. The proposed client/server architecture integrates modular components powered by semantic web technologies and audio content feature extraction. This enables recorded music content to be controlled in flexible and nonlinear ways. Dynamic music objects can be used to create mashups on the fly of two or more simultaneous songs to allow selection of multiple moods. The authors also consider nonlinear audio techniques that could transform the player into a creative tool, for instance, by reorganizing, compressing, or expanding temporally prerecorded content.

Engineering reports

Audealize: Crowdsourced Audio Production Tools

Authors: Seetharaman, Prem; Pardo, Bryan

While professional audio development tools such as reverberators and equalizers are widely available to musicians who are not expert audio engineers, their interfaces can be frustrating. Professional interfaces are parameterized in terms of low-level signal manipulations, which are not intuitive to nonexperts. This report describes Audealize as an interface that bridges the gap between low-level parameters of existing audio production tools and programmatic goals, such as “make my guitar sound underwater.” Users modify the audio by selecting descriptive terms in a word-map built from a crowdsourced vocabulary of word labels for audio effects. A study with a population of 432 nonexperts found that they favored the crowdsourced word-map over traditional interfaces. Absolute performance measures showed that those who used the word-map interface produced results that were equal to or better than traditional interfaces. Also, participants preferred the word-map interface on the word matching task. The effectiveness of the interface was surprising because the interface was not designed for this task. One would expect the fine control afforded by the signal-parameter interface would let the user match the effect more closely than would be likely with the word-map. This indicates that a crowdsourced language is an effective interaction paradigm for novice users of audio production tools.


[Feature] The design of car audio systems involves an understanding of the challenging acoustics of the cabin. Systems can be “tuned” for specific listener positions. Compromises may need to be made depending on aspects of car design and production management. It is also possible to engineer enhanced spatial listening experiences in cars, and upmixing makes it possible to generate suitable signals for surround and vertical loudspeakers.

2016 Sound Field Control Conference Report, Guildford

142nd Call for Papers and Engineering Briefs, Berlin


AES Conventions and Conferences


Table of Contents

Cover & Sustaining Members List

AES Officers, Committees, Offices & Journal Staff

Institutional Subscribers: If you would like to log into the E-Library using your institutional log in information, please click HERE.

Choose your country of residence from this list:

Skip to content