Kyrgyz Whisper Medium (merged)

This repository provides merged model weights for Kyrgyz ASR. The model was created by LoRA fine-tuning and then merging the adapter into the base model.

Links

What does “merged” mean?

During training, I fine-tuned a LoRA adapter (PEFT) and then used merge_and_unload() to bake the adapter weights into the base model. This repo contains the resulting standalone Transformers model (no PEFT needed for inference).

If you want the lightweight adapter-only version, see:

Dataset

  • Training/evaluation dataset: fsicoli/common_voice_22_0 (config: ky)

Results

Evaluation on Common Voice 22.0 Kyrgyz (test split):

  • WER (normalized): 16.2061
  • WER_ortho (orthographic): 19.1491
  • test_loss: 0.1722

Quick check (200 random test samples):

  • WER: 16.1677
  • WER_ortho: 19.6021

Training details

LoRA fine-tuning summary:

  • LoRA: r=8, lora_alpha=16, lora_dropout=0.1
  • Target modules: q_proj, v_proj
  • Steps: max_steps=4000
  • Best checkpoint by WER: checkpoint-4000 (WER=16.21)

Training progress (selected checkpoints):

Step Train loss Val loss WER_ortho WER
500 0.7980 0.7911 44.3501 42.0754
1000 0.3980 0.2043 28.9947 27.8551
1500 0.1712 0.1821 20.7479 17.7343
2000 0.1734 0.1770 20.7569 17.6977
2500 0.1935 0.1743 19.7995 16.8192
3000 0.3406 0.1728 19.8988 16.9656
3500 0.3192 0.1724 19.3840 16.4074
4000 0.1499 0.1722 19.1491 16.2061

How to use

Install

pip install -U "transformers" "accelerate" "torch"

Inference (Transformers pipeline)

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "AleksTv/whisper-medium-ky-merged"

device = 0 if torch.cuda.is_available() else -1
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

# Standard Whisper processor/tokenizer files are included in this repo.
# No remote custom Python code is required.
processor = AutoProcessor.from_pretrained(model_id)

asr = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=device,
)

print(asr("path/to/audio.wav")["text"])

Tips

  • For long audio, quality usually improves with VAD/segmentation + stitching.
  • Prefer 16 kHz mono WAV (or rely on the pipeline to resample).

Limitations

  • Performance may degrade on very noisy audio, overlapping speech, and long recordings without segmentation.
  • ASR models may occasionally hallucinate text on difficult segments.

License

Apache-2.0.

Citation

If you use this model, please cite Whisper:

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022}
}
Downloads last month
22
Safetensors
Model size
0.8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AleksTv/whisper-medium-ky-merged

Finetuned
(1)
this model

Dataset used to train AleksTv/whisper-medium-ky-merged

Space using AleksTv/whisper-medium-ky-merged 1

Evaluation results