Whistle

Whistle is a multilingual and crosslingual ASR model pretrained with weak phonetic supervision using IPA transcriptions generated by LanguageNet G2P models. Unlike self-supervised or grapheme-based approaches, Whistle leverages phoneme-level representations to enable better data efficiency, crosslingual generalization, and reduced catastrophic forgetting. Trained and evaluated on the CommonVoice-based CV-Lang10 benchmark, Whistle demonstrates superior performance on both seen and unseen languages under limited-data conditions.

Whistle was proposed in the paper Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision by Saierdaer Yusuyin et al from THU-SPMI. The original code repository can be found here.

Model details

Whistle is a Conformer based encoder model, and trained using the CTC (Connectionist Temporal Classification) approach. It was trained on ~4k hours of labelled speech data sourced from the publicly available CommonVoice_v11

Whistle checkpoints come in three configurations of varying model sizes. Including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing. All of the pre-trained checkpoints are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub:

Evaluation

Results are reported in Phoneme Error Rate (PER%) and Word Error Rate (WER%).

Evaluation on Public CommonVoice_v11

  • %PER
    Model Parameters en es fr it ky nl ru sv-SE tr tt Avg.
    small 90 MB 8.02 3.37 5.68 4.04 8.29 5.77 6.05 18.07 8.32 8.53 7.61
    medium 218 MB 6.70 2.63 4.53 3.12 5.95 3.95 4.61 14.81 6.04 8.47 6.08
    large 543 MB 5.42 1.96 3.52 2.25 4.06 2.64 2.97 11.33 4.04 5.97 4.41
Model Parameters en es fr it ky nl ru sv-SE tr tt pl Avg.
small-finetune-polish 90 MB - - - - - - - - - - 1.97 -
  • %WER with 4-gram LM
    Model Parameters en es fr it ky nl ru sv-SE tr tt Avg.
    small 90 MB 10.76 8.68 16.01 9.98 1.02 7.32 1.59 6.14 7.63 7.30 7.64
    medium 218 MB 9.83 7.82 14.94 9.04 0.91 6.57 1.65 5.65 7.27 7.37 7.10
    Large 543 MB 8.80 7.02 14.02 8.16 0.94 6.22 1.46 5.06 7.05 6.92 6.56
Model Parameters en es fr it ky nl ru sv-SE tr tt pl Avg.
small-finetune-polish 90 MB - - - - - - - - - - 4.30 -

More performance please ref to benchmark

Training Data

All of our multilingual ASR model are trained with 10 languages of cv-lang10, which has been processed in lang-process. But for English wav2vec-base model and multilingul wav2vec-base model, only audio are used to train the model. The language ID and training hours of the ten languages are in the following table.

Language Language ID # of phonemes Train hours Dev hours Test hours
English en 39 2227.3 27.2 27.0
Spanish es 32 382.3 26.0 26.5
French fr 33 823.4 25.0 25.4
Italian it 30 271.5 24.7 26.0
Kirghiz ky 32 32.7 2.1 2.2
Dutch nl 39 70.2 13.8 13.9
Russian ru 32 149.8 14.6 15.0
Swedish sv-SE 33 29.8 5.5 6.2
Turkish tr 41 61.5 10.1 11.4
Tatar tt 31 20.8 3.0 5.7

BibTeX entry and citation info

@article{yusuyin2025whistle,
  title={Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision},
  author={Yusuyin, Saierdaer and Ma, Te and Huang, Hao and Zhao, Wenbo and Ou, Zhijian},
  journal={IEEE Transactions on Audio, Speech and Language Processing},
  year={2025},
  publisher={IEEE}
}

Community

If you encounter problems in use, you can directly raise Issues on the github page.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for thu-spmi/whistle-small-polish

Finetuned
(2)
this model