You are currently logged in as an
Institutional Subscriber.
If you would like to logout,
please click on the button below.
Home / Publications / E-library page
Only AES members and Institutional Journal Subscribers can download
The ability to generate high-quality synthetic speech with fine-grained control over intonation and emotion is crucial for broadcast applications such as narration and voice-over production. This paper presents a novel text-to-speech (TTS) workflow designed to generate expressive and user-controllable speech synthesis. By leveraging a one-hot encoded emotional embedding approach through application-aware, manually labeled emotion classes and a fine-tuned neural model, our method allows for precise control over speech tone and intention. We detail the data preparation pipeline, including manual annotation and automatic emotion classification, and describe the implementation of our speech synthesis model using VITS2. Additionally, we introduce an intuitive prosody control mechanism that enables word- and phoneme-level adjustments for pitch and duration. Subjective evaluation results indicate that our synthetic narration achieves parity with professionally recorded voice-overs, with some preference for synthetic speech due to its enhanced customization. This research contributes to the advancement of AI-driven voice production, enabling scalable, high-fidelity, and fine-grained controlled speech synthesis for media and entertainment industries.
Author (s): Kruszielski, Luiz Fernando; Leite, Pedro H.L.; Fernandes, Myllene P.; Pereira, Andre; Biscainho, Luiz W. P.
Affiliation:
Globo Group; Globo Group; Globo Group / Federal University of Rio de Janeiro - UFRJ; Federal University of Rio de Janeiro - UFRJ / Polytech Grenoble
(See document for exact affiliation information.)
Publication Date:
2025-09-02
Import into BibTeX
Session subject:
Artificial Intelligence and Machine Learning for Audio
Permalink: https://aes2.org/publications/elibrary-page/?id=23020
(457KB)
Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member Join the AES. If you need to check your member status, login to the Member Portal.
Kruszielski, Luiz Fernando; Leite, Pedro H.L.; Fernandes, Myllene P.; Pereira, Andre; Biscainho, Luiz W. P.; 2025; Broadcast-quality synthetic narration: a workflow for fine-grained text-to-speech intonation and emotion control [PDF]; Globo Group; Globo Group; Globo Group / Federal University of Rio de Janeiro - UFRJ; Federal University of Rio de Janeiro - UFRJ / Polytech Grenoble; Paper 31; Available from: https://aes2.org/publications/elibrary-page/?id=23020
Kruszielski, Luiz Fernando; Leite, Pedro H.L.; Fernandes, Myllene P.; Pereira, Andre; Biscainho, Luiz W. P.; Broadcast-quality synthetic narration: a workflow for fine-grained text-to-speech intonation and emotion control [PDF]; Globo Group; Globo Group; Globo Group / Federal University of Rio de Janeiro - UFRJ; Federal University of Rio de Janeiro - UFRJ / Polytech Grenoble; Paper 31; 2025 Available: https://aes2.org/publications/elibrary-page/?id=23020
@article{kruszielski2025broadcast-quality,
author={kruszielski luiz fernando and leite pedro h.l. and fernandes myllene p. and pereira andre and biscainho luiz w. p.},
journal={journal of the audio engineering society},
title={broadcast-quality synthetic narration: a workflow for fine-grained text-to-speech intonation and emotion control},
year={2025},
number={31},
month={september},}
TY – paper
TI – Broadcast-quality synthetic narration: a workflow for fine-grained text-to-speech intonation and emotion control
AU – Kruszielski, Luiz Fernando
AU – Leite, Pedro H.L.
AU – Fernandes, Myllene P.
AU – Pereira, Andre
AU – Biscainho, Luiz W. P.
PY – 2025
JO – Journal of the Audio Engineering Society
VL – 31
Y1 – September 2025