AES E-Library

Handling real-world challenges of variable speech quality and multiple speakers in forensic automatic speaker recognition using VOCALISE

Recordings presented for forensic speaker recognition comparison generally have a variety of complexities that must be taken into account before analysis. These could involve the quality of the recording, the duration and linguistic diversity in the speech of the speaker of interest, and the presence of other speakers in the recording. One of the first tasks that the forensic analyst has to perform is to separate out a sample of speech from only the speaker of interest, and make sure there is an adequate quantity and quality to create a voice model for an automatic comparison. This task is not straightforward and requires experience and skill from the practitioner to create a reliable model. Traditionally, automatic speaker comparison implicitly assumes that input data contains only one speaker per recording or has been preprocessed to contain only one speaker. However, depending on the nature of the case, there may be a varying number of speakers in different recordings and forensic practitioners may not have the capacity to preprocess many files, each potentially containing multiple unknown speakers. The quality of a recording effectively depends on both the acoustic signal quality as well as the quantity and diversity of linguistic content present in the recording. In determining whether a file can be used for forensic analysis, the practitioner has to frequently rely on their often-subjective determination of the quality of the recording. For instance, they may decide that a recording is very noisy and contains very little speech, and therefore unlikely to be useful for analysis. It would be very helpful to provide the practitioner with some objective quality metric information so that they can decide whether to proceed with their analysis, and to indicate what the potential error rates might be. In this article, using the latest version of VOCALISE (Voice Comparison and Analysis of the Likelihood of Speech Evidence), a forensic automatic speaker recognition system, we present some pragmatic solutions to some of these common case-related issues faced by forensic practitioners. For the objective analysis of the acoustic quality of different recordings we consider a star rating audio quality metric (from 1 to 5) that includes net-speech duration, signal-to-noise ratio, and amount of clipping in the signal. We also compare two options for handling multi-speaker files including using manual selections from a multi-speaker file and an automatic segmental mode, which splits recordings into segments of an adjustable length and overlap. Segmental mode facilitates the task of determining whether a multi-speaker recording contains the speech of a specific speaker, and at what point in the recording their speech occurs, in a fully automatic way. Using the forensically relevant WYRED speaker recognition database, we demonstrate the effect of comparing files of different star quality ratings and examine the error rates obtained by using three different approaches to handling multi-speaker recordings.

 

Author (s):
Affiliation: (See document for exact affiliation information.)
Publication Date:
Session subject:
Permalink: https://aes2.org/publications/elibrary-page/?id=22629


(1023KB)


Download Now

Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member Join the AES. If you need to check your member status, login to the Member Portal.

Type:
E-Libary location:
16938
Choose your country of residence from this list:










Skip to content