E-library page

AES E-Library

Evaluating the Impact of Speech Signal Restoration on ASR Performance: A Comparative Study of Speech-Transformer and Conformer Architectures in Polish Language Recognition

The study addresses the problem of automatic speech recognition (ASR) of Polish under conditions of impaired acoustic signal transmission. It tested two ASR model architectures in an End-To-End (E2E) approach: Conformer (based on the STT Pl FastConformer Hybrid Transducer-CTC Large P&C (NVIDIA) model adapted to Polish) and Speech-Transformer (based on the multilingual model available in the ESPnet toolkit). Speech-Transformer architecture contains convolution layers at the input, followed by Transformer blocks. In the Conformer architecture, Transformer blocks are augmented with additional convolution layers (forming so-called Conformer blocks). The study used a dataset containing only Polish language: the Mobile Corpus (EMU) (with speech imitating a telephone conversation, with limited bandwidth and artifacts). In speech recognition for an impaired signal, two correction procedures stand out: augmenting the models training data with impaired samples and repairing the speech signal before feeding it to the input of the E2E ASR model. This study used the second approach, improving the quality of speech samples using declipping and band restoration methods. The examination was performed in two tests: feeding a speech signal to the input of the ASR E2E model before and after the repair process. The efficiency of the models was assessed by Word Error Rate (WER) and Character Error Rate (CER). Restoring the speech signal resulted in an audible improvement in sound quality, but the results for the ASR models were inconclusive. WER values decreased for Speech-Transformer model, but at the same time increased for the Conformer model. For a multilingual model (like the ESPnet toolkit model), the speech recognition process is divided into two stages: language identification and actual speech recognition. In the first test (speech before repair), the multilingual model made numerous confusions between Polish and Russian at the first of these stages. Improving signal quality significantly improved the identification of the correct language for the model (threefold increase in correct identification of the Polish language). The divergent results may indicate that the impact of methods used to improve signal quality can be both positive and negative, depending on the parameters of the ASR E2E model.

Author (s): Szymla, Julia; Pondel-Sycz, Karolina; Pietrzak, Agnieszka Paula
Affiliation: Warsaw University of Technology; Warsaw University of Technology; Warsaw University of Technology (See document for exact affiliation information.)
AES Convention: 156 Paper Number:237
Publication Date: 2024-06-06 Import into BibTeX
Permalink: https://aes2.org/publications/elibrary-page/?id=22583

(975KB)

This paper costs $33 for non-members and is free for AES members and E-Libary subscribers.

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member Join the AES. If you need to check your member status, login to the Member Portal.

Type: Express Paper
E-Libary location: TMP/conv/156/

Learn more about the AES E-Library

AES Conventions

AES Conferences

AES Training & Development

Gift Membership

AES Membership Benefits

Gift Membership

AES Membership Benefits

Become a Sustaining Member

AES Membership Benefits

AES Inside Track

Current Standards

Standards Blog

Journal of the AES

AES E-library

Special Publications

AES Sections are active around the world and provide a means for members to meet locally.

AES Student Website

AES Educational Foundation

Student Sections

See the committee’s accomplishments in diversity & inclusion

AES Statement of solidarity

AES E-Library

Evaluating the Impact of Speech Signal Restoration on ASR Performance: A Comparative Study of Speech-Transformer and Conformer Architectures in Polish Language Recognition

Choose your country of residence from this list:

AES E-Library

Login Institutions

Evaluating the Impact of Speech Signal Restoration on ASR Performance: A Comparative Study of Speech-Transformer and Conformer Architectures in Polish Language Recognition

Choose your country of residence from this list: