Kyrgyz Whisper Medium (merged)
This repository provides merged model weights for Kyrgyz ASR. The model was created by LoRA fine-tuning and then merging the adapter into the base model.
Links
- Base model: https://huggingface.co/nineninesix/kyrgyz-whisper-medium
- Whisper paper: https://arxiv.org/abs/2212.04356
- Whisper Medium (architecture reference): https://huggingface.co/openai/whisper-medium
What does “merged” mean?
During training, I fine-tuned a LoRA adapter (PEFT) and then used merge_and_unload() to bake the adapter weights into the base model. This repo contains the resulting standalone Transformers model (no PEFT needed for inference).
If you want the lightweight adapter-only version, see:
Dataset
- Training/evaluation dataset:
fsicoli/common_voice_22_0(config:ky)
Results
Evaluation on Common Voice 22.0 Kyrgyz (test split):
WER(normalized): 16.2061WER_ortho(orthographic): 19.1491test_loss: 0.1722
Quick check (200 random test samples):
WER: 16.1677WER_ortho: 19.6021
Training details
LoRA fine-tuning summary:
- LoRA:
r=8,lora_alpha=16,lora_dropout=0.1 - Target modules:
q_proj,v_proj - Steps:
max_steps=4000 - Best checkpoint by WER:
checkpoint-4000(WER=16.21)
Training progress (selected checkpoints):
| Step | Train loss | Val loss | WER_ortho | WER |
|---|---|---|---|---|
| 500 | 0.7980 | 0.7911 | 44.3501 | 42.0754 |
| 1000 | 0.3980 | 0.2043 | 28.9947 | 27.8551 |
| 1500 | 0.1712 | 0.1821 | 20.7479 | 17.7343 |
| 2000 | 0.1734 | 0.1770 | 20.7569 | 17.6977 |
| 2500 | 0.1935 | 0.1743 | 19.7995 | 16.8192 |
| 3000 | 0.3406 | 0.1728 | 19.8988 | 16.9656 |
| 3500 | 0.3192 | 0.1724 | 19.3840 | 16.4074 |
| 4000 | 0.1499 | 0.1722 | 19.1491 | 16.2061 |
How to use
Install
pip install -U "transformers" "accelerate" "torch"
Inference (Transformers pipeline)
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
model_id = "AleksTv/whisper-medium-ky-merged"
device = 0 if torch.cuda.is_available() else -1
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
# Standard Whisper processor/tokenizer files are included in this repo.
# No remote custom Python code is required.
processor = AutoProcessor.from_pretrained(model_id)
asr = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=device,
)
print(asr("path/to/audio.wav")["text"])
Tips
- For long audio, quality usually improves with VAD/segmentation + stitching.
- Prefer 16 kHz mono WAV (or rely on the pipeline to resample).
Limitations
- Performance may degrade on very noisy audio, overlapping speech, and long recordings without segmentation.
- ASR models may occasionally hallucinate text on difficult segments.
License
Apache-2.0.
Citation
If you use this model, please cite Whisper:
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022}
}
- Downloads last month
- 22
Model tree for AleksTv/whisper-medium-ky-merged
Base model
openai/whisper-medium
Finetuned
nineninesix/kyrgyz-whisper-medium
Dataset used to train AleksTv/whisper-medium-ky-merged
Space using AleksTv/whisper-medium-ky-merged 1
Evaluation results
- WER (normalized) on Common Voice 22.0 (ky)test set self-reported16.206
- WER (orthographic) on Common Voice 22.0 (ky)test set self-reported19.149