Comparing state-of-the-art speaker diarization frameworks : Pyannote vs Nemo

La Javaness R&D
9 min readJul 17, 2023

--

Introduction

In the previous article of this series, the concept of speaker diarization or finding “who spoke when?” has been introduced along with the typical architecture of modern speaker diarization pipelines. We also presented a simple labelling method for creating reference files for diarization as well as the standard metrics to measure the performance of models.

Speaker diarization is now a well-known and quite general problem. Thus, one can easily leverage pre-existing frameworks and models and adapt them to specific use cases. Especially, one can use a pre-trained pipeline out-of-the-box or, if needed, customize a pipeline by optimizing hyperparameters and/or further fine-tune the models.

This article focuses on two state-of-the-art open-source frameworks for speaker diarization: pyannote.audio (H. Bredin) [1] and NeMo (Nvidia) [2]. We conduct an in-depth comparison of the frameworks and test them on specific use cases. The objective is to choose the most suitable framework to use in a project for a multitask tool for conversation analysis.

Overview of frameworks

Two main frameworks for speaker diarization are studied here:

  • 🎹 pyannote — an open-source library based on pytorch framework dedicated to speaker diarization [1,4]. It is developed by Hervé Bredin.
  • 🐠 NeMo — an open-source deep learning framework developed by Nvidia, also based on pytorch, for NLP tasks with a large emphasis on speech processing. It provides tools (pipeline, trainer, pre-trained models, etc…) to build AI applications [2].

The following table resumes the main models involved in the speaker diarization pipeline along with links to research papers if you want to understand how these models work (Spoiler : It is super interesting ! 🤠 )

Overall, major differences are between each submodule of the pipeline, but the frameworks share the same global pipeline taking audio files as input and generating RTTM files (see previous article for details on RTTM format) as output. One can note that Nemo is a way larger library for developing ASR and NLP applications and hence comes with a well-defined framework for training, evaluation and inference with models, while pyannote is primarily developed for speaker diarization and seems more research-oriented.

Main differences between models are :

  • VAD model from pyannote, Pyannet, processes the raw audio waveform signal while MarbleNet from Nemo takes a Mel-spectrogram as input (the standard input for ASR and generally audio processing tasks).
  • Audio-embedding models are very similar to convolutional neural networks. However, the clustering task from Nemo is based on Multi-scale clustering which means that it leverages several latent representations of the audio input at different time scales and then performs the clustering based on multiple latent embedding of voices. pyannotehas a simpler approach with a standard Hidden Markov Model clustering.

The following table presents the pros and cons of each frameworks from our perspective and test cases :

How to use the frameworks and measure performance?

To test the frameworks, we are using 2 different test cases :

  1. A simple test case — A conversation with 2 speakers recorded in high quality.
  2. A complex test case — A recording with several conversations involving 7 speakers recorded with varying quality.

Create a reference file and parsing functions

To create a ground truth labelling file, we load the audio in Audacity, remove useless parts of the audio at the beginning and end of the recording and create a segmentation labelling according to the method described in the first article. The marker track from Audacity can be exported as a .txt file. As mentioned in the first article, the standard format when working with a speaker diarization pipeline is the RTTM format.

To make things easier, we are using the pyannote.core library, which is part of pyannoteenvironment, and implements convenient Python objects for segments, timeline and annotations specifically designed for speaker diarization. We propose a set of modifications to the Annotation object to match the specific needs :

class Annotation(Annotation):
@classmethod
def from_rttm(
cls, rttm_file: TextIO, uri: Optional[str] = None, modality: Optional[str] = None,
) -> "Annotation":
"""Create annotation from rttmParameters
----------
rttm_file : string,
path to the rttm file
uri : string, optional
name of annotated resource (e.g. audio or video file)
modality : string, optional
name of annotated modality
Returns
-------
annotation : Annotation
New annotation
"""
segment_list = []
for line in rttm_file:
line = line.rstrip().split(" ")
segment_list.append(
(
Segment(start=float(line[3]), end=float(line[3]) + float(line[4])),
int(line[2]),
str(line[7]),
)
)
return Annotation.from_records(segment_list, uri, modality)
def _iter_audacity(self) -> Iterator[Text]:
"""Generate lines for a audacity marker file for this annotation
Returns
-------
iterator: Iterator[str]
An iterator over audacity text lines
"""
for segment, _, label in self.itertracks(yield_label=True):
yield f"{segment.start:.3f}\t{segment.start + segment.duration:.3f}\t{label}\n"
def to_audacity(self) -> Text:
"""Serialize annotation as a string using Audacity format
Returns
-------
serialized: str
audacity marker string
"""
return "".join([line for line in self._iter_audacity()])
def write_audacity(self, file: TextIO):
"""Dump annotation to file using Audacity format
Parameters
----------
file : file object
Usage
-----
>>> with open('file.txt', 'w') as file:
... annotation.write_audacity(file)
"""
for line in self._iter_audacity():
file.write(line)
@classmethod
def from_audacity(
cls, audacity_file: str, uri: Optional[str] = None, modality: Optional[str] = None,
) -> "Annotation":
"""Create annotation from rttm file
Parameters
----------
audacity_txt_file : string,
path to the rttm file
uri : string, optional
name of annotated resource (e.g. audio or video file)
modality : string, optional
name of annotated modality
Returns
-------
annotation : Annotation
New annotation
"""
segment_list = []
for line in audacity_file:
start, end, label = line.rstrip().split("\t")
segment_list.append(
(Segment(start=float(start), end=float(end)), 1, str(label))
)
return Annotation.from_records(segment_list, uri, modality)

Perform basic inference with frameworks

Then, both models can be used very easily. pyannote works with the Hugging Face framework while NeMo has its own framework and the configuration is based on Hydra. The simplest way to generate a RTTM file with pyannote out-of-the-box is :

from pyannote.audio import Pipeline
import os

TOKEN = 'your-hf-token'
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1",
use_auth_token=TOKEN)
audio_input = '/path/to/audio.wav'
# apply the pipeline to an audio file
diarization = pipeline(audio_input, num_speakers = 2)
# dump the diarization output to disk using RTTM format
with open("audio-test-pyannote.rttm", "w") as rttm:
diarization.write_rttm(rttm)

For further information see https://github.com/pyannote/pyannote-audio

For Nemo starting parameters file can be downloaded on GitHub repo: https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/inference

The simplest way to use it is :

import json
import os
from NeMo.collections.asr.models import ClusteringDiarizer
from omegaconf import OmegaConf

input_file = '/path/to/audio.wav' # Note : The file needs to be mono .wav
meta = {
'audio_filepath': input_file,
'offset': 0,
'duration':None,
'label': 'infer',
'text': '-',
'num_speakers': 7,
'rttm_filepath': None, # You can add a reference file here
'uem_filepath' : None
}
with open('../data/input_manifest.json','w') as fp:
json.dump(meta,fp)
fp.write('\n')
output_dir = os.path.join('../data/', 'oracle_vad')
os.makedirs(output_dir,exist_ok=True)
MODEL_CONFIG = '../data/param.yaml'
config = OmegaConf.load(MODEL_CONFIG)
print(OmegaConf.to_yaml(config))
sd_model = ClusteringDiarizer(cfg = config)
sd_model.diarize()

This code is directly inspired from NeMo ‘s tutorial (more complete)

https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb#scrollTo=CwtVUgVNBR_P

Evaluate the performance

Once the inference is performed with both models, metrics can be computed. The easiest way to compute the standard Diarization Error Rate (DER) is to use the implementation from pyannote.metrics, the pyannote package dedicated to metrics and evaluation. pyannote implements specific types of objects “Annotation” and “Segment”.

Using our custom enhanced Annotation object and pyannote.metrics, we can easily compute the DER by loading the reference file and the hypothesis :

from pyannote.metrics.diarization import DiarizationErrorRate

with open('path/to/ref.txt') as f :
ref = Annotation.from_audacity(f)
with open('path/to/hyp.rttm') as f :
hyp = Annotation.from_rttm(f)

der = DiarizationErrorRate()

der_result = der(ref,hyp)

Results

To quickstart with models and get results, we use default recommended parameters for both pipelines on the simple test case to observe macro behavior of models.

There is a significant difference in the granularity of the segmentation for each model which also differs from the reference labeling. Indeed, for a ~30-minute audio input, pyannote generates 126 speech segments while NeMo generates 1209 segments. The reference file manually labelled contains 251 segments. In order to make NeMo better match the reference segmentation, we tune the VAD model’s hyperparameters. Especially, the window size and shift length are significantly increased to handle coarser segmentations that match better the expected behavior for our test case.

We can test the model’s segmentation against the reference and compute the main metrics for speaker diarization. Here we used the DiarizationMetricInOne script to compute those metrics. The results are reported in the following table:

Note: MD = Missed Detection, FA = False Alarm, SC = Speaker Confusion; see previous article for details on metrics

While pyannote and Nemowith optimized VAD parameters achieve good results with only 10% error rate, Nemo‘s default pipeline has significantly higher error rates. However, these results need to be mitigated since 36% of the DER comes from missed speeches. Nemo (with default parameters) detects voices with way finer granularity detecting only audio segments with linguistic content and deleting small non-speech segments (mouth noise, breathing, etc…). This has been checked by taking results from Nemo diarization and concatenating non-speech segments in a new audio file. Then, this audio file has been labelled to detect missed speech (voice activity with actual linguistic content). There is 8% of linguistic content in predicted non-speech segments.

Then, the pipelines are tested with the complex test case. The optimized VAD parameters are kept in order to match the expected segmentation granularity.

Here, Nemo performs way better than pyannote even if the number of clusters is specified. Especially, Nemo pipeline seems to handle better than pyannote lower audio quality. Nemo is less likely to detect overlapping speeches which seem to match better common casual conversational content. Therefore, we chose Nemo as the reference framework for further testing and optimization.

In terms of time performance and hardware resource requirements, both pipelines involve large deep-learning models based on pytorch. Therefore, similar inference can be expected. Since Nemo is developed by Nvidia, the library is well optimized to work on Cuda devices making inference very efficient when working on large Cuda GPU (see next article). Overall, these frameworks are not meant to be used in real-time applications and require quite large GPUs to perform segmentation efficiently with the default models. In order to perform inference on CPU or small GPU, one might consider changing the default models to lighter models, especially by using a smaller speaker embedding model (Titanet small instead of Titanet large for example) or implementing custom models with quantization techniques for instance.

Conclusion

To sum up, both models perform well on the simple test case, but their behavior is highly dependent on the hyperparameters of the pipeline. Especially, parameters of the VAD model have a large influence on the granularity of the segmentation. Depending on the use case, one might want a small number of larger segments containing full sentences or oppositely, a finer granularity segmenting audio with precise linguistic content deleting all small non-speech content (mouth noise, breathing, etc…). Given the results on both test cases as well as our hands-on experience with both frameworks, Nemo seems more suitable when it comes to developing and optimizing voice processing applications.

Références

[1] Bredin, H., Yin, R., Coria, J., Gelly, G., Korshunov, P., Lavechin, M., Fustes, D., Titeux, H., Bouaziz, W., & Gill, M.P. (2020). pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, & Jonathan M. Cohen. (2019). NeMo: a toolkit for building AI applications using Neural Modules.

[3] Herve Bredin (2017). pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association.

[4] Gaofeng Cheng, Yifan Chen, Runyan Yang, Qingxuan Li, Zehui Yang, Lingxuan Ye, Pengyuan Zhang, Qingqing Zhang, Lei Xie, Yanmin Qian, Kong Aik Lee, & Yonghong Yan. (2022). The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines.

[5] Nithin Rao Koluguri, Taejin Park, & Boris Ginsburg. (2021). TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context.

[6] Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam, & Boris Ginsburg. (2022). Multi-scale Speaker Diarization with Dynamic Scale Weighting.

[7] Brecht Desplanques, Jenthe Thienpondt, & Kris Demuynck (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Interspeech 2020. ISCA.

[8] Mirco Ravanelli, & Yoshua Bengio. (2019). Speaker Recognition from Raw Waveform with SincNet.

[9] Fei Jia, Somshubra Majumdar, & Boris Ginsburg. (2021). MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection.

Acknowledgement

Thanks to our colleagues Alexandre DO, Jean-Bapiste BARDIN, Lu WANG and Edouard LAVAUD for the article review.

About

Jules SINTES is a data scientist at La Javaness.

--

--

La Javaness R&D
La Javaness R&D

Written by La Javaness R&D

We help organizations to succeed in the new paradigm of “AI@scale”, by using machine intelligence responsibly and efficiently : www.lajavaness.com