Local normalization and delayed decision making in speaker detection and tracking |
Title: Local normalization and delayed decision making in speaker detection and tracking
Author(s): Johan Koolwaaij & Lou Boves
Reference: Special Issue of Digital Signal Processing: A Review Journal on the NIST Speaker Recognition Workshop, Volume 10, Number 1-3, pp. 113-132
Keywords: Speaker Recognition
This paper describes A2RT's speaker detection and tracking system and its performance on the 1999 NIST speaker recognition evaluation data. The system does not consist of concatenated modules like, for instance, silence-speech detection, handset and gender detection, and finally speaker detection or tracking, where each module builds on the hard decisions from previous modules, but rather applies the principle of delayed decision making, and postpones all hard decisions until the final stage of the detection process. This paper focuses on two important locality issues in detecting or tracking speakers in a telephone conversation, for which the speaker change frequency is usually high. First, channel estimation needs sufficiently long but homogeneous segments. Several kinds of local channel normalization are compared in this paper. And second, local estimation of speaker likelihoods critically depends on the segmentation of the conversation.
Our experiments show that a global level of segmentation really improves
speaker tracking performance, whereas a more detailed segmentation is needed
for speaker detection, because likelihood computation over clusters of segments depends
on the purity of the segments. Furthermore, choosing the appropriate type of channel
normalization can give a small but consistent improvement in speaker tracking performance.
Error processing SSI file