asr-whisper-helpline-sw-v1
This model is a fine-tuned version of openai/whisper-large-v2 on the Common Voice Swahili dataset.
Model Description
This ASR model is specifically fine-tuned for Swahili speech recognition in the context of the Tanzania Child Helpline, powered by OpenCHS (Open Source Child Helpline System). The model is designed to transcribe Swahili spoken in Tanzanian call center environments.
Performance Highlights:
- Best Validation WER: 23.56% (achieved at step 7,500 of continued training)
- Baseline WER: 89.05% (Whisper Large v2 zero-shot on Common Voice 17.0)
- Improvement:
65.5 percentage point reduction in WER (73.5% error rate reduction)
This represents a significant improvement over the base Whisper Large v2 model for Swahili transcription tasks.
Training Strategy
The model was trained in two stages:
- Stage 1 - Common Voice 17.0: Initial fine-tuning on Common Voice 17.0 Swahili dataset (10,000 steps)
- Stage 2 - Common Voice 23.0: Continued fine-tuning on Common Voice 23.0 Swahili dataset (7,500 steps)
Total Training: 17,500 effective steps with the best checkpoint selected at step 7,500 of stage 2 based on lowest validation WER.
Intended Uses & Limitations
Intended Uses
- Primary: Transcribing Swahili speech in call center environments, specifically for child helpline services in Tanzania
- General: Swahili automatic speech recognition tasks
- Research: Baseline for domain adaptation studies (general speech โ telephony/call center audio)
Limitations
- Domain Shift: Model is trained on Common Voice (clean, read speech) but intended for call center audio. Performance on actual telephony audio may differ and requires validation.
- Language Variety: Training data may not fully represent all Tanzanian Swahili dialects and speaking styles.
- Audio Quality: Performance may degrade with low-quality audio, background noise, or poor recording conditions typical in telephony.
- Code-Switching: May not handle code-switching between Swahili and English/other languages well.
Known Issues
- Domain-specific evaluation on actual call center audio is pending
Training and Evaluation Data
Stage 1: Common Voice 17.0 (Swahili)
Training Configuration:
- Training samples: Streamed entire Common Voice 17.0 Swahili training split
- Validation samples: 2,000 samples
- Source: Mozilla Common Voice 17.0
- Language: Swahili (sw)
- Data type: Read speech from diverse speakers
- Streaming mode: Used dataset streaming to minimize disk usage
Stage 1 Results:
- Final validation WER: 23.62%
- Training steps: 10,000
Stage 2: Common Voice 23.0 (Swahili)
Training Configuration:
- Starting point: Best checkpoint from Stage 1
- Training samples: Common Voice 23.0 Swahili training split (downloaded locally)
- Validation samples: 2,000 samples
- Source: Mozilla Common Voice 23.0
- Language: Swahili (sw)
Stage 2 Results:
- Best validation WER: 23.56% at step 7,500
- Training continued to 10,000 steps but early stopping applied retrospectively
Baseline Performance:
- Base Whisper Large v2 (zero-shot): 89.05% WER on Common Voice 17.0 validation
Training Procedure
Training Hyperparameters - Stage 1 (Common Voice 17.0)
Optimization:
- learning_rate: 1e-05
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
- training_steps: 10,000
Batch Configuration:
- train_batch_size: 16
- eval_batch_size: 16
- gradient_accumulation_steps: 1
Memory Optimization:
- gradient_checkpointing: true
- mixed_precision_training: Native AMP (FP16)
- dataloader_num_workers: 2
Evaluation & Checkpointing:
- evaluation_strategy: steps (every 500 steps)
- save_steps: 500
- logging_steps: 50
Other:
- seed: 42
Training Hyperparameters - Stage 2 (Common Voice 23.0)
Optimization:
- learning_rate: 5e-06 (reduced from Stage 1)
- lr_scheduler_type: cosine_with_restarts
- lr_scheduler_warmup_steps: 500
- optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
- training_steps: 20,000 (stopped at 10,000, best at 7,500)
Batch Configuration:
- train_batch_size: 16
- eval_batch_size: 16
Memory Optimization:
- mixed_precision_training: Native AMP (FP16)
Evaluation & Checkpointing:
- evaluation_strategy: steps (every 500 steps)
- save_steps: 500
- logging_steps: 50
Other:
- seed: 42
Training Results - Stage 2 (Common Voice 23.0)
| Training Loss | Epoch | Step | Validation Loss | WER |
|---|---|---|---|---|
| 0.0598 | 0.025 | 500 | 0.3869 | 24.8021 |
| 0.0488 | 0.05 | 1000 | 0.4222 | 26.9086 |
| 0.0481 | 0.075 | 1500 | 0.4376 | 26.9910 |
| 0.0804 | 0.1 | 2000 | 0.4429 | 29.2177 |
| 0.0433 | 0.125 | 2500 | 0.4569 | 26.8025 |
| 0.0939 | 1.0043 | 3000 | 0.4150 | 27.8982 |
| 0.0875 | 1.0293 | 3500 | 0.4083 | 24.3874 |
| 0.0237 | 1.0543 | 4000 | 0.4526 | 24.2578 |
| 0.0246 | 1.0793 | 4500 | 0.4664 | 24.9647 |
| 0.0267 | 1.1043 | 5000 | 0.4719 | 26.1074 |
| 0.0239 | 1.1293 | 5500 | 0.4533 | 24.6583 |
| 0.0327 | 2.0086 | 6000 | 0.4381 | 23.6923 |
| 0.0254 | 2.0336 | 6500 | 0.4369 | 23.7512 |
| 0.0155 | 2.0586 | 7000 | 0.4463 | 23.6216 |
| 0.0263 | 2.0836 | 7500 | 0.4469 | 23.5627 โ Best checkpoint |
| 0.0249 | 2.1086 | 8000 | 0.4821 | 25.9189 |
| 0.0233 | 2.1336 | 8500 | 0.4914 | 27.0500 |
| 0.036 | 3.0129 | 9000 | 0.4738 | 24.1517 |
| 0.0485 | 3.0379 | 9500 | 0.4758 | 24.9647 |
| 0.0132 | 3.0629 | 10000 | 0.5175 | 25.5655 |
Training Observations:
- Initial performance on CV23: 24.80% WER (step 500)
- Progressive improvement to best WER of 23.56% at step 7,500
- Performance degraded slightly after step 7,500 (overfitting indicators)
- Model weights restored to step 7,500 checkpoint for optimal performance
Combined Training Summary
Stage 1 (CV 17.0):
- Steps: 0 โ 10,000
- Starting WER: 43.68% โ Final WER: 23.62%
Stage 2 (CV 23.0):
- Steps: 0 โ 7,500 (best checkpoint)
- Starting WER: 24.80% โ Best WER: 23.56%
Total Effective Training: ~17,500 steps across two datasets
Performance Comparison
| Model | Dataset | Split | WER | Improvement from Baseline |
|---|---|---|---|---|
| Whisper Large v2 (baseline) | CV 17.0 | Validation | 89.05% | - |
| This model (Stage 1) | CV 17.0 | Validation | 23.62% | -65.43 pp (73.5% reduction) |
| This model (Stage 2 - Best) | CV 23.0 | Validation | 23.56% | -65.49 pp (73.5% reduction) |
Note: The two-stage training approach with dataset progression (CV 17.0 โ CV 23.0) achieved marginal improvement in final WER while ensuring model robustness across Common Voice versions.
Usage
from transformers import pipeline
# Load the model
pipe = pipeline("automatic-speech-recognition",
model="openchs/asr-whisper-helpline-sw-v1")
# Transcribe audio
result = pipe("path/to/swahili_audio.wav")
print(result["text"])
Advanced Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
# Load model and processor
processor = WhisperProcessor.from_pretrained("openchs/asr-whisper-helpline-sw-v1")
model = WhisperForConditionalGeneration.from_pretrained("openchs/asr-whisper-helpline-sw-v1")
# Load and process audio
# ... your audio loading code ...
# Generate transcription
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
Future Work
- Domain Evaluation: Assessment on actual Tanzania Child Helpline call center audio to measure domain shift impact
- Domain Adaptation: Fine-tuning on telephony/call center audio for improved production performance
- Error Analysis: Detailed analysis of failure cases to identify improvement opportunities
- Test Set Evaluation: Comprehensive evaluation on Common Voice 23.0 test split
Citation
If you use this model, please cite:
@misc{openchs-swahili-asr-v1,
title={Swahili ASR Model for Tanzania Child Helpline},
author={OpenCHS Team},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/openchs/asr-whisper-helpline-sw-v1}}
}
Framework Versions
- Transformers: 4.56.2
- PyTorch: 2.8.0+cu128
- Datasets: 2.21.0
- Tokenizers: 0.22.1
License
Apache 2.0
Acknowledgments
- Base model: OpenAI Whisper Large v2
- Training data: Mozilla Common Voice 17.0 and Mozilla Common Voice 23.0
- Project: OpenCHS
- Downloads last month
- 570
Model tree for openchs/asr-whisper-helpline-sw-v1
Evaluation results
- Wer on mozilla-foundation/common_voice_23_0validation set self-reported23.563