This is a ResNet-152 speaker recognition model trained on a combined version of the VoxBlink2 dataset, which contains 111,284 speakers.

This model is designed to generalize well across multiple domains, performing robustly on both telephone speech and cleaner, wideband audio (e.g., VoxCeleb).

To achieve this, the training data consists of two parts:

  1. The original 16 kHz data from VoxBlink2.
  2. An augmented version of the same data, which was downsampled to 8 kHz, had the GSM codec applied to 50% of its data, and was then upsampled back to 16 kHz.

The backbone was trained using the WeSpeaker toolkit, following their standard VoxCeleb recipe.

Results on SRE-24

EER (%) min Cprimary
Development 8.99 0.541
Evaluation 7.53 0.582

Results on VoxCeleb1

EER (%)
VoxCeleb1-O 1.65
VoxCeleb1-E 1.37
VoxCeleb1-H 2.73

Citation

If you use this model in your research, please cite the following paper:

@inproceedings{barahona25_interspeech,
  title     = {{Analysis of ABC Frontend Audio Systems for the NIST-SRE24}},
  author    = {Sara Barahona and Anna Silnova and Ladislav Mošner and Junyi Peng and Oldřich Plchot and Johan Rohdin and Lin Zhang and Jiangyu Han and Petr Palka and Federico Landini and Lukáš Burget and Themos Stafylakis and Sandro Cumani and Dominik Boboš and Miroslav Hlavaček and Martin Kodovsky and Tomaš Pavliček},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {5763--5767},
  doi       = {10.21437/Interspeech.2025-2737},
  issn      = {2958-1796},
}
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support