AES E-Library

Evaluating the Impact of Speech Signal Restoration on ASR Performance: A Comparative Study of Speech-Transformer and Conformer Architectures in Polish Language Recognition

The study addresses the problem of automatic speech recognition (ASR) of Polish under conditions of impaired acoustic signal transmission. It tested two ASR model architectures in an End-To-End (E2E) approach: Conformer (based on the STT Pl FastConformer Hybrid Transducer-CTC Large P&C (NVIDIA) model adapted to Polish) and Speech-Transformer (based on the multilingual model available in the ESPnet toolkit). Speech-Transformer architecture contains convolution layers at the input, followed by Transformer blocks. In the Conformer architecture, Transformer blocks are augmented with additional convolution layers (forming so-called Conformer blocks). The study used a dataset containing only Polish language: the Mobile Corpus (EMU) (with speech imitating a telephone conversation, with limited bandwidth and artifacts). In speech recognition for an impaired signal, two correction procedures stand out: augmenting the models training data with impaired samples and repairing the speech signal before feeding it to the input of the E2E ASR model. This study used the second approach, improving the quality of speech samples using declipping and band restoration methods. The examination was performed in two tests: feeding a speech signal to the input of the ASR E2E model before and after the repair process. The efficiency of the models was assessed by Word Error Rate (WER) and Character Error Rate (CER). Restoring the speech signal resulted in an audible improvement in sound quality, but the results for the ASR models were inconclusive. WER values decreased for Speech-Transformer model, but at the same time increased for the Conformer model. For a multilingual model (like the ESPnet toolkit model), the speech recognition process is divided into two stages: language identification and actual speech recognition. In the first test (speech before repair), the multilingual model made numerous confusions between Polish and Russian at the first of these stages. Improving signal quality significantly improved the identification of the correct language for the model (threefold increase in correct identification of the Polish language). The divergent results may indicate that the impact of methods used to improve signal quality can be both positive and negative, depending on the parameters of the ASR E2E model.

 

Author (s):
Affiliation: (See document for exact affiliation information.)
AES Convention: Paper Number:
Publication Date:
Permalink: https://aes2.org/publications/elibrary-page/?id=22583


(975KB)


Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member Join the AES. If you need to check your member status, login to the Member Portal.

Type:
E-Libary location:
16938
Choose your country of residence from this list:










Skip to content