Speaker Diarization: An Introductory Overview

7 min readJul 17, 2023

Introduction

Speaker diarization is the process of automatically identifying and segmenting an audio recording into distinct speech segments, where each segment corresponds to a particular speaker. In simpler words, the goal is to answer the question: who spoke when? It involves analyzing the audio signal to detect changes in speaker identity, and then grouping together segments that belong to the same speaker.

Speaker diarization is a key component of conversation analysis tools and can often be coupled with Automatic Speech Recognition (ASR) or Speech Emotion Recognition (SER) to extract meaningful information from conversational content. Hence, speaker diarization provides important information when performing speech analysis that involves several speakers as shown in Figure 1.

Figure 1. Illustration of the importance of speaker diarization within automatic speech recognition pipeline

This article is the first of a series containing 3 articles on speaker diarization. It aims at introducing the main concepts behind speaker diarization and provides a large overview of the diarization task in machine learning.

Main approaches

The speaker diarization task relates to answering the question: who spoke when? For such a problem, two main approaches can be considered:

Supervised approach / Classification problem: a model is trained to recognize a finite number of speakers (i.e. classes). Thus, it is expected that it only works with speakers that were used among the training data.

Use case example — Efficient weekly team meeting transcription: the system has been trained to recognize different members of a team. This approach works when processing audio involves always the same speakers (classes). It is less flexible but it can be very efficient, especially for a large number of speakers (>5)

Unsupervised approach / Clustering problem: a model clusters audio segments according to the speaker based on extracted audio features. It is the most versatile approach since it can detect the number of speakers involved (number of clusters) and assign each voice segment to a specific cluster.

Use case example — Call center analytics: the system needs work for any speaker involved in the conversation and it has no prior knowledge of the speakers involved in the conversation (operator/customer).

Speaker diarization pipeline: who spoke when!?

Early approaches for speaker diarization were based on Digital Signal Processing techniques and later on, based on statistical modeling (Hidden Markov Model, Gaussian Mixture, etc). Modern speaker diarization techniques rely on advanced deep learning and neural networks. In the context of unsupervised speaker diarization (which is the most common approach), the pipeline involves several subtasks to detect voice activity and segment audio into speech and non-speech segments and then clustering them according to the speaker [2,3]. The typical pipeline is represented in Figure 2.

Figure 2. Scheme of a typical speaker diarization pipeline

First, Voice Activity Detection is meant as a binary classification of small segments within the audio input. The model detects if there is any voice activity in each small segment. These are defined with a window size and shifted all along the input signal (see Figure 3). Then, to extract efficiently meaningful features from the different voices, one needs an Audio Embedding model trained to encode audio speech segments in a latent space. This latent representation of speech segments is then passed to the Clustering (or Diarizer) algorithm.

Figure 3. Scheme of window shift and length parameters in VAD models

How to evaluate a speaker diarization pipeline?

What metrics?

The standard metric for speaker diarization problem is the Diarization error rate [4]:

This metric computes the error in diarization in terms of a duration ratio, which means that a perfectly aligned hypothesis and reference diarization would result in DER=0. It sums up 3 error terms as described in Table 1. It can be considered as a micro metric like standard accuracy for classification problems.

Here, False Alarm and Missed Detection are related to the voice activity detection step of the speaker diarization pipeline and Confusion is linked to the quality of the clustering. Figure 4 illustrates how the DER is computed from Reference (ground truth) and Hypothesis (system’s prediction) segmentations.

Figure 4. Schematic example of Diarization Error Rate computation

While DER is a convenient and simple standard time-based metric to evaluate the global quality of a model, one might need more advanced metrics to assess the quality of diarization. Especially, just like accuracy does in a classification problem, DER gives more importance to long speech segments while short speech segments can carry important linguistic content. To tackle this issue, a recent paper introduced the Conversational Diarization Error Rate (CDER) which considers error rate in terms of segments (vs duration in DER) [5].

In our evaluation, we also computed the Balanced Error Rate introduced in recent research. This metric is much more complex but also delivers more meaningful information on the quality of a model since it considers errors on both time and segment levels, giving a macro measure of diarization quality by balancing error rates by speakers and improving matching evaluation between the reference and hypothesis. Details on the method and the algorithm for implementing Balanced Error Rate (BER) are extensively described in the dedicated paper [1].

Building the ground truth

To evaluate predicted diarization with the metrics described above, one needs a reference diarization i.e. the ground truth. The standard format for speaker diarization is the RTTM format.

SPEAKER audio-file 1 18.62 14.87 <NA> <NA> speaker_1 <NA> <NA> 
SPEAKER audio-file 1 33.49 8.5 <NA> <NA> speaker_2 <NA> <NA> 
SPEAKER audio-file 1 41.99 1.0 <NA> <NA> speaker_1 <NA> <NA> 
SPEAKER audio-file 1 42.99 1.12 <NA> <NA> speaker_3 <NA> <NA>

The RTTM format contains 3 main important pieces of information on the segmentation: the segment start time, segment length and speaker label. This type of file can easily be parsed and can process as a convenient Python object, for example with the annotation object from pyannote.core package [4] (see next article).

Labelling and segmenting audio data is quite a difficult task since it does not rely on visual access to the data but only on audio content. For speaker diarization, one needs to listen to the audio and note timestamps for the start and end times of speech segments along with a label corresponding to a speaker. It can be done by hand with a simple audio player software, but Audacity is a great open-source software for audio processing that comes with a marker track feature that allows one to easily select and label audio segments and then export markers as a CSV-like file (see Figure 4).

Figure 5 — How to label speakers in audio with Audacity

Conclusion

In this first article on this topic, we introduced the concept of speaker diarization and gave an overview of modern speaker diarization pipelines associated with the unsupervised approach. These pipelines are based on several subtasks involving deep learning models for Voice Activity Detection, Audio Embedding and Clustering. In addition, the main metrics to evaluate speaker diarization has been introduced along with a simple method to generate a reference file (i.e. ground truth) for model evaluation.

An efficient diarization pipeline is a key component for automatic speech recognition for conversational content. Indeed, it provides a meaningful segmentation that can be further processed and serve as a basis for ASR and NLP tasks. An example of a complete conversational analysis tool is shown in Figure 6.

Figure 6. Example of architecture for complete speech processing pipeline

Pre-trained models and frameworks can be easily leveraged to perform speaker diarization with low error rates, and modern frameworks are flexible and can be tuned for many different use cases without having to finetune models. In the following articles, we will first focus on an in-depth comparison of two main state-of-the-art frameworks for speaker diarization: pyannote and NeMo. Then, we propose a deep dive into the practical use of the NeMo framework for specific use cases.

References

[1] Tao Liu, & Kai Yu. (2022). BER: Balanced Error Rate For Speaker Diarization.

[2] Bredin, H., Yin, R., Coria, J., Gelly, G., Korshunov, P., Lavechin, M., Fustes, D., Titeux, H., Bouaziz, W., & Gill, M.P. (2020). pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3] Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, & Jonathan M. Cohen. (2019). NeMo: a toolkit for building AI applications using Neural Modules.

[4] Herve Bredin (2017). pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association.

[5] Gaofeng Cheng, Yifan Chen, Runyan Yang, Qingxuan Li, Zehui Yang, Lingxuan Ye, Pengyuan Zhang, Qingqing Zhang, Lei Xie, Yanmin Qian, Kong Aik Lee, & Yonghong Yan. (2022). The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines.

Acknowledgement

Thanks to our colleagues Alexandre DO, Jean-Bapiste BARDIN, Lu WANG and Edouard LAVAUD for the article review.

About

Jules SINTES is a data scientist at La Javaness.