You are currently logged in as an
Institutional Subscriber.
If you would like to logout,
please click on the button below.
Home / Publications / E-library page
Only AES members and Institutional Journal Subscribers can download
The study addresses the problem of automatic speech recognition (ASR) of Polish under conditions of impaired acoustic signal transmission. It tested two ASR model architectures in an End-To-End (E2E) approach: Conformer (based on the STT Pl FastConformer Hybrid Transducer-CTC Large P&C (NVIDIA) model adapted to Polish) and Speech-Transformer (based on the multilingual model available in the ESPnet toolkit). Speech-Transformer architecture contains convolution layers at the input, followed by Transformer blocks. In the Conformer architecture, Transformer blocks are augmented with additional convolution layers (forming so-called Conformer blocks). The study used a dataset containing only Polish language: the Mobile Corpus (EMU) (with speech imitating a telephone conversation, with limited bandwidth and artifacts). In speech recognition for an impaired signal, two correction procedures stand out: augmenting the models training data with impaired samples and repairing the speech signal before feeding it to the input of the E2E ASR model. This study used the second approach, improving the quality of speech samples using declipping and band restoration methods. The examination was performed in two tests: feeding a speech signal to the input of the ASR E2E model before and after the repair process. The efficiency of the models was assessed by Word Error Rate (WER) and Character Error Rate (CER). Restoring the speech signal resulted in an audible improvement in sound quality, but the results for the ASR models were inconclusive. WER values decreased for Speech-Transformer model, but at the same time increased for the Conformer model. For a multilingual model (like the ESPnet toolkit model), the speech recognition process is divided into two stages: language identification and actual speech recognition. In the first test (speech before repair), the multilingual model made numerous confusions between Polish and Russian at the first of these stages. Improving signal quality significantly improved the identification of the correct language for the model (threefold increase in correct identification of the Polish language). The divergent results may indicate that the impact of methods used to improve signal quality can be both positive and negative, depending on the parameters of the ASR E2E model.
Author (s): Szymla, Julia; Pondel-Sycz, Karolina; Pietrzak, Agnieszka Paula
Affiliation:
Warsaw University of Technology; Warsaw University of Technology; Warsaw University of Technology
(See document for exact affiliation information.)
AES Convention: 156
Paper Number:237
Publication Date:
2024-06-06
Import into BibTeX
Permalink: https://aes2.org/publications/elibrary-page/?id=22583
(975KB)
Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member Join the AES. If you need to check your member status, login to the Member Portal.
Szymla, Julia; Pondel-Sycz, Karolina; Pietrzak, Agnieszka Paula; 2024; Evaluating the Impact of Speech Signal Restoration on ASR Performance: A Comparative Study of Speech-Transformer and Conformer Architectures in Polish Language Recognition [PDF]; Warsaw University of Technology; Warsaw University of Technology; Warsaw University of Technology; Paper 237; Available from: https://aes2.org/publications/elibrary-page/?id=22583
Szymla, Julia; Pondel-Sycz, Karolina; Pietrzak, Agnieszka Paula; Evaluating the Impact of Speech Signal Restoration on ASR Performance: A Comparative Study of Speech-Transformer and Conformer Architectures in Polish Language Recognition [PDF]; Warsaw University of Technology; Warsaw University of Technology; Warsaw University of Technology; Paper 237; 2024 Available: https://aes2.org/publications/elibrary-page/?id=22583
@article{szymla2024evaluating,
author={szymla julia and pondel-sycz karolina and pietrzak agnieszka paula},
journal={journal of the audio engineering society},
title={evaluating the impact of speech signal restoration on asr performance: a comparative study of speech-transformer and conformer architectures in polish language recognition},
year={2024},
number={237},
month={may},}
TY – paper
TI – Evaluating the Impact of Speech Signal Restoration on ASR Performance: A Comparative Study of Speech-Transformer and Conformer Architectures in Polish Language Recognition
AU – Szymla, Julia
AU – Pondel-Sycz, Karolina
AU – Pietrzak, Agnieszka Paula
PY – 2024
JO – Journal of the Audio Engineering Society
VL – 237
Y1 – May 2024