E-library page

AES E-Library

Decoding Vocal Articulations from acoustic latent representations

We present a novel neural encoder system for acoustic-to-articulatory inversion. We leverage the Pink Trombone voice synthesizer that reveals articulatory parameters (e.g tongue position and vocal cord configuration). Our system is designed to identify the articulatory features responsible for producing specific acoustic characteristics contained in a neural latent representation. To generate the necessary latent embeddings, we employed two main methodologies. The first was a self-supervised variational autoencoder trained from scratch to reconstruct the input signal at the decoder stage. We conditioned its bottleneck layer with a subnetwork called the "projector," which decodes the voice synthesizers parameters.
The second methodology utilized two pretrained models: EnCodec and Wav2Vec. They eliminate the need to train the encoding process from scratch, allowing us to focus on training the projector network. This approach aimed to explore the potential of these existing models in the context of acoustic-to-articulatory inversion. By reusing the pretrained models, we significantly simplified the data processing pipeline, increasing efficiency and reducing computational overhead.
The primary goal of our project was to demonstrate that these neural architectures can effectively encapsulate both acoustic and articulatory features. This prediction-based approach is much faster than traditional methods focused on acoustic feature-based parameter optimization. We validated our models by predicting six different parameters and evaluating them with objective and ViSQOL subjective-equivalent metric using both synthesizer-and human-generated sounds. The results show that the predicted parameters can generate human-like vowel sounds when input into the synthesizer. We provide the dataset, code, and detailed findings to support future research in this field.

Author (s): Cámara, Mateo; Marcos, Fernando; Blanco, Jose Luis
Affiliation: Grupo de Aplicaciones del Procesado de Señal, Universidad Politécnica de Madrid, Spain, and Information Processing and Telecommunications Center, Universidad Politécnica de Madrid, Spain; Grupo de Aplicaciones del Procesado de Señal, Universidad Politécnica de Madrid, Spain, and Information Processing and Telecommunications Center, Universidad Politécnica de Madrid, Spain; Grupo de Aplicaciones del Procesado de Señal, Universidad Politécnica de Madrid, Spain (See document for exact affiliation information.)
AES Convention: 156 Paper Number:251
Publication Date: 2024-06-06 Import into BibTeX
Permalink: https://aes2.org/publications/elibrary-page/?id=22596

(612KB)

This paper costs $33 for non-members and is free for AES members and E-Libary subscribers.

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member Join the AES. If you need to check your member status, login to the Member Portal.

Type: Express Paper
E-Libary location: TMP/conv/156/

Learn more about the AES E-Library

About AES

Code of Conduct

AES Conventions

AES Conferences

AES Training & Development

Gift Membership

AES Membership Benefits

Gift Membership

AES Membership Benefits

Become a Sustaining Member

AES Membership Benefits

AES Inside Track

Current Standards

Standards Blog

Journal of the AES

AES E-library

Special Publications

AES Sections are active around the world and provide a means for members to meet locally.

AES Student Website

AES Educational Foundation

Student Sections

See the committee’s accomplishments in diversity & inclusion

AES Statement of solidarity

AES E-Library

Decoding Vocal Articulations from acoustic latent representations

Choose your country of residence from this list:

AES E-Library

Login Institutions

Decoding Vocal Articulations from acoustic latent representations

Choose your country of residence from this list: