Deep-dive into Nemo : How to efficiently tune a speaker diarization pipeline ?

La Javaness R&D
5 min readJul 17, 2023

Introduction

After introducing the speaker diarization problem and presenting the 2 main state-of-the-art frameworks pyannote.audio and NeMo, this third article on speaker diarization aims at presenting more in detail how a speaker diarization pipeline can be tuned to get the best results for specific purposes.

We emphasize on understanding how the different hyperparameters of the NeMo speaker diarization pipeline can be tuned to change and adapt the behavior of models to match expected results for specific use cases.

The tuning of the parameters of NeMo’s diarization pipeline was done based on the speaker diarization notebook provided in NeMo’s documentation. Here the objective is:

  1. Understand the behaviour of each model through its hyperparameters.
  2. Minimize error rate for our defined test cases.
  3. Optimize inference time and evaluate GPU requirements.

Defining expected behavior or reference file

First, one needs to define a test case along with the expected behavior of the model. There are several things to be considered, a few examples of relevant questions are :

  • What should be the granularity of the segmentation? e.g. at a sentence level, word level?
  • Is it likely to have overlapping speeches considering the use case?
  • What kind of error is the most important to minimize: False alarm, speaker confusion, missed detection?

Many other aspects can be taken into consideration and will influence how the hyperparameters of models should be tuned.

Figure 1 — Example of different granularity in the segmentation

Tuning NeMo pipeline

The pipeline is mainly defined by the 3 neural network models as defined previously. We chose to keep the default pre-trained model from Nvidia’s catalog.

Voice activity detection

Voice Activity Detection is the first stage within the speaker diarization pipeline. It involves processing audio input spectrograms sequentially with a small frame (between 0.1 and 1s usually) and classifying whether there is voice activity or not. The 2 main hyperparameters are the Window length and Window shift, it defines the size and shift of the audio input passed to the model. Thus, it has a large influence on the granularity of the model. Indeed, choosing a larger window size and shift is likely to lead to longer detected speech segments.

Figure 2 — Voice activity detection scheme

Other parameters include onset/offset threshold for speech detection as well as padding parameters which are likely to influence the sensitivity of the voice detection making the model more/less likely to detect false speech activity.

Figure 3 — Comparison of a 60s segmentation for different window length (L) and shift (S) values (in s)

The figure above illustrates how changing the window length and shift affects the granularity of the segmentation. Especially, when parameters are becoming too large (1.5,0.1), there are no more non-speech segments while small values of parameters lead to a very fine segmentation with very small segments. In many cases, one might want either a very precise segmentation on a word scale level or more commonly an intermediate behavior at a sentence level scale.

Speakers embedding and clustering

NeMo comes with a multiscale approach of clustering which allows performing the diarization based on temporal multiscale embedding. It means that the segments are embedded with different time scales. This allows us to deal with a trade-off between good voice embedding which needs longer segments and fine granularity of the segmentation.

Figure 4 — Scheme explaining the multi-scale approach for speaker embedding from NeMo’s doc

The default parameters provided involve 5 embeddings ranging from 0.5 seconds segments to 1.5 seconds. For a simple test case where one wants larger segments, only one embedding might be sufficient, especially with a larger time scale (1.0 or 1.5 seconds). However, when dealing with more complex text cases (e.g. up to 8 speakers), leveraging the multi-scale embedding is helpful in order to achieve high-quality results.

Quick inference time and GPU requirements study

To evaluate hardware resource requirements and optimize GPU usage while reducing inference time we monitored inference time and GPU usage for different batch sizes.

Figure 5 — Monitoring of GPU memory reserved for different batch size (left) and Inference time computation vs batch size (right)

Since the pipeline involves quite large models, especially Titanet Large, it needs a minimum of GPU memory to perform inference. Performing batched inference significantly reduces computing time however choosing a batch size larger than 32 does not reduce inference time while GPU usage increases exponentially. Thus, choosing a batch size between 4 and 16 will drastically reduce computing time while keeping GPU memory usage at the minimum (<4GB).

Conclusion

Optimizing a speaker diarization pipeline for a specific use case needs first a clear definition of the expected behavior of the models and expected results. Then, through a deep understanding of the hyperparameters, one can tweak them to match the expected behavior. Since the effect of each hyperparameter (especially for VAD) is quite explainable, tweaking parameters by hand leads to better control of the pipeline while being more efficient in common use cases than using parameter search algorithms for example.

NeMo’s pipeline includes 3 trainable models which all have hyperparameters. The most important model for the segmentation is the VAD which needs to be tuned precisely to get the best results. Then, while it needs more computing time and resources, the multi-scale embedding approach from NeMo allows to get high-quality results and offers flexibility when needing quicker segmentation by performing batch inference or reducing the number of embedding.

Moreover, here, we only focused on default models and used them out-of-the-box. In the production context, with a defined use case, one might want to fine-tune the models using a specific dataset matching the use case. This could significantly help to reduce the error rate while making the tuning of hyperparameters easier. The following table sums up the main parameters to tune and the values that we used for our optimized pipeline. A few other parameters can be tweaked to get the bests results depending on the use case and the quality of audio.

And this concludes this series of 3 articles about speaker diarization! ✅

Acknowledgement

Thanks to our colleagues Alexandre DO, Jean-Bapiste BARDIN, Lu WANG and Edouard LAVAUD for the article review.

About

Jules SINTES is a data scientist at La Javaness.

--

--

La Javaness R&D

We help organizations to succeed in the new paradigm of “AI@scale”, by using machine intelligence responsibly and efficiently : www.lajavaness.com