Whistle

Whistle is a multilingual and crosslingual ASR model pretrained with weak phonetic supervision using IPA transcriptions generated by LanguageNet G2P models. Unlike self-supervised or grapheme-based approaches, Whistle leverages phoneme-level representations to enable better data efficiency, crosslingual generalization, and reduced catastrophic forgetting. Trained and evaluated on the CommonVoice-based CV-Lang10 benchmark, Whistle demonstrates superior performance on both seen and unseen languages under limited-data conditions.

Whistle was proposed in the paper Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision by Saierdaer Yusuyin et al from THU-SPMI. The original code repository can be found here.

Model details

Whistle is a Conformer based encoder model, and trained using the CTC (Connectionist Temporal Classification) approach. It was trained on ~4k hours of labelled speech data sourced from the publicly available CommonVoice_v11

Whistle checkpoints come in three configurations of varying model sizes. Including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing. All of the pre-trained checkpoints are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub:

Evaluation

Results are reported in Phoneme Error Rate (PER%) and Word Error Rate (WER%).

Evaluation on Public CommonVoice_v11

%PER

Model	Parameters	en	es	fr	it	ky	nl	ru	sv-SE	tr	tt	Avg.
small	90 MB	8.02	3.37	5.68	4.04	8.29	5.77	6.05	18.07	8.32	8.53	7.61
medium	218 MB	6.70	2.63	4.53	3.12	5.95	3.95	4.61	14.81	6.04	8.47	6.08
large	543 MB	5.42	1.96	3.52	2.25	4.06	2.64	2.97	11.33	4.04	5.97	4.41

Model	Parameters	en	es	fr	it	ky	nl	ru	sv-SE	tr	tt	pl	Avg.
small-finetune-polish	90 MB	-	-	-	-	-	-	-	-	-	-	1.97	-

%WER with 4-gram LM

Model	Parameters	en	es	fr	it	ky	nl	ru	sv-SE	tr	tt	Avg.
small	90 MB	10.76	8.68	16.01	9.98	1.02	7.32	1.59	6.14	7.63	7.30	7.64
medium	218 MB	9.83	7.82	14.94	9.04	0.91	6.57	1.65	5.65	7.27	7.37	7.10
Large	543 MB	8.80	7.02	14.02	8.16	0.94	6.22	1.46	5.06	7.05	6.92	6.56

Model	Parameters	en	es	fr	it	ky	nl	ru	sv-SE	tr	tt	pl	Avg.
small-finetune-polish	90 MB	-	-	-	-	-	-	-	-	-	-	4.30	-

More performance please ref to benchmark

Training Data

All of our multilingual ASR model are trained with 10 languages of cv-lang10, which has been processed in lang-process. But for English wav2vec-base model and multilingul wav2vec-base model, only audio are used to train the model. The language ID and training hours of the ten languages are in the following table.

Language	Language ID	# of phonemes	Train hours	Dev hours	Test hours
`English`	`en`	39	2227.3	27.2	27.0
`Spanish`	`es`	32	382.3	26.0	26.5
`French`	`fr`	33	823.4	25.0	25.4
`Italian`	`it`	30	271.5	24.7	26.0
`Kirghiz`	`ky`	32	32.7	2.1	2.2
`Dutch`	`nl`	39	70.2	13.8	13.9
`Russian`	`ru`	32	149.8	14.6	15.0
`Swedish`	`sv-SE`	33	29.8	5.5	6.2
`Turkish`	`tr`	41	61.5	10.1	11.4
`Tatar`	`tt`	31	20.8	3.0	5.7

BibTeX entry and citation info

@article{yusuyin2025whistle,
  title={Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision},
  author={Yusuyin, Saierdaer and Ma, Te and Huang, Hao and Zhao, Wenbo and Ou, Zhijian},
  journal={IEEE Transactions on Audio, Speech and Language Processing},
  year={2025},
  publisher={IEEE}
}

Community

If you encounter problems in use, you can directly raise Issues on the github page.

Downloads last month: 11

Model tree for thu-spmi/whistle-small-polish

Base model

thu-spmi/whistle-small

Finetuned

(2)

this model