Application of Automatic Speaker Verification in Forensic Casework

View all publications


Reference

Title: Application of Automatic Speaker Verification in Forensic Casework

Author(s): Johan Koolwaaij

Reference: Proceedings International Association for Forensic Phonetics (IAFP'98), Voorburg

Keywords: Speaker Recognition

Abstract

Speaker Verification is the process of verifying an identity claim using the voice of the person making that claim. So input is an utterance of unknown origin plus an identity claim and there are two possible outcomes: accepted (the unknown utterance orginated from the claimed identity) or rejected (the unknown utterance orginated from someone else than the claimed identity). When the verification is done by a computer it is called Automatic Speaker Verification (ASV).

The ASV algorithm's output is a likelihood ratio, which can be translated into a decision using an application dependent decision criterion. In some commercial applications of ASV the percentage of erroneous decisions is already below 1%. But in these applications the degree of control over the speech content of the unknown utterances is high. For example, only digits are allowed as valid speech tokens and a speaker is modeled by a set of ten digit models. This is called text dependent verification.

In most forensic applications of ASV there are no ways to control or to predict the speech content of, for example, a conversation. In this context we have to use text independent verification, which is a more difficult problem. The speaker model contains speaker information as well as significant variation due to the diversity in speech content. As a result the error rates are an order of magnitude higher than for text dependent verification.

But in forensics it is not accepted to deliver only the final decision of the ASV system based on a non-transparent decision criterion and stating that the system makes a erroneous decision in n percent of the cases. But more precise information is needed about this individual case, so the position of the unknown utterance has to be shown on the likelihood ratio scale, together with the error rates (both false accept rate and false reject rate) of that ASV system as a function of the likelihood ratio. So a complete evaluation of the performance of the ASV system has to be available.

And, even more important, it has to be made plausible that the current case shares it characterics with the cases on which the error figures are based. For example, when the error figures are made using digitally recorded telephone speech, but the current case is a comparison between a voice recorded on a tape and probably the same voice from a telephone tap, little be can said about the real false accept rate and false reject rate on the likelihood ratio point of the unknown utterance. And there are a lot of these important characteristics, including degree of mismatch between train and test data, recording environment, microphone, recording platform, language and duration of train and test data.

On the basis of a case study we try to address the importance of all these characteristics. And the main question is not how to improve a certain ASV system, but given an ASV system how to come to a balanced decision in individual cases. Error processing SSI file