VoxCeleb Dataset

Dataset Description

VoxCeleb is a collection of voice recording of celebrities extracted from various Youtube videos. It contains:

Identities

Sample count

train

1211

148642

dev / eval

references

40

4874

probes

37720

The dev and eval sets are a copy of each other for this protocol. The following results will then only show the development set.

GMM

To run the baseline, use the following command:

bob bio pipeline simple -d voxceleb gmm-mobio -l sge-demanding -o results/gmm_voxceleb -n 512

Then, to generate the scores, use:

bob bio metrics -e ./results/gmm_voxceleb/scores-dev.csv
Table 14 [Min. criterion: EER ] Threshold on Development set: 1.062216e-01

Development

Failure to Acquire

0.0%

False Match Rate

18.8% (3538/18860)

False Non Match Rate

18.8% (3538/18860)

False Accept Rate

18.8%

False Reject Rate

18.8%

Half Total Error Rate

18.8%

On 128[1] CPU nodes on the SGE Grid: Ran in 10 hours.

ISV

TODO

Speechbrain ECAPA-TDNN

This baseline reproduces the speaker verification experiment with a pretrained ECAPA-TDNN model using the SpeechBrain library. The original paper’s reference is the following:

@inproceedings{spear,
  author = {Brecht Desplanques, Jenthe Thienpondt and Kris Demuynck},
  title = {{ECAPA-TDNN:} Emphasized Channel Attention, Propagation and Aggregation in {TDNN} Based Speaker Verification},
  booktitle = {Interspeech 2020},
  year = {2020},
  url = {https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2650.pdf},
}

To run the baseline, use the following command:

bob bio pipeline simple -vvv -d voxceleb -p speechbrain-ecapa-voxceleb -g dev -o ./results/speechbrain_voxceleb

Then, to generate the scores, use:

bob bio metrics -e ./results/speechbrain_voxceleb/scores-dev.csv
Table 15 [Min. criterion: EER] Threshold on Development set: -6.159925e-01

Development

Failure to Acquire

0.0%

False Match Rate

1.0% (189/18860)

False Non Match Rate

1.0% (189/18860)

False Accept Rate

1.0%

False Reject Rate

1.0%

Half Total Error Rate

1.0%

On 128[1] CPU nodes on the SGE Grid: Ran in 9 minutes (no training).

Note

ECAPA-TDNN gives a reference result of 0.8% EER on VoxCeleb. However, they were using a customized version of the dataset (VoxCeleb (cleaned)) which ignores 109 probe files (presumably containing wrong data) from our own dataset.

Footnotes