asr-whisper-helpline-sw-v1

This model is a fine-tuned version of openai/whisper-large-v2 on the Common Voice Swahili dataset.

Model Description

This ASR model is specifically fine-tuned for Swahili speech recognition in the context of the Tanzania Child Helpline, powered by OpenCHS (Open Source Child Helpline System). The model is designed to transcribe Swahili spoken in Tanzanian call center environments.

Performance Highlights:

  • Best Validation WER: 23.56% (achieved at step 7,500 of continued training)
  • Baseline WER: 89.05% (Whisper Large v2 zero-shot on Common Voice 17.0)
  • Improvement: 65.5 percentage point reduction in WER (73.5% error rate reduction)

This represents a significant improvement over the base Whisper Large v2 model for Swahili transcription tasks.

Training Strategy

The model was trained in two stages:

  1. Stage 1 - Common Voice 17.0: Initial fine-tuning on Common Voice 17.0 Swahili dataset (10,000 steps)
  2. Stage 2 - Common Voice 23.0: Continued fine-tuning on Common Voice 23.0 Swahili dataset (7,500 steps)

Total Training: 17,500 effective steps with the best checkpoint selected at step 7,500 of stage 2 based on lowest validation WER.

Intended Uses & Limitations

Intended Uses

  • Primary: Transcribing Swahili speech in call center environments, specifically for child helpline services in Tanzania
  • General: Swahili automatic speech recognition tasks
  • Research: Baseline for domain adaptation studies (general speech โ†’ telephony/call center audio)

Limitations

  • Domain Shift: Model is trained on Common Voice (clean, read speech) but intended for call center audio. Performance on actual telephony audio may differ and requires validation.
  • Language Variety: Training data may not fully represent all Tanzanian Swahili dialects and speaking styles.
  • Audio Quality: Performance may degrade with low-quality audio, background noise, or poor recording conditions typical in telephony.
  • Code-Switching: May not handle code-switching between Swahili and English/other languages well.

Known Issues

  • Domain-specific evaluation on actual call center audio is pending

Training and Evaluation Data

Stage 1: Common Voice 17.0 (Swahili)

Training Configuration:

  • Training samples: Streamed entire Common Voice 17.0 Swahili training split
  • Validation samples: 2,000 samples
  • Source: Mozilla Common Voice 17.0
  • Language: Swahili (sw)
  • Data type: Read speech from diverse speakers
  • Streaming mode: Used dataset streaming to minimize disk usage

Stage 1 Results:

  • Final validation WER: 23.62%
  • Training steps: 10,000

Stage 2: Common Voice 23.0 (Swahili)

Training Configuration:

  • Starting point: Best checkpoint from Stage 1
  • Training samples: Common Voice 23.0 Swahili training split (downloaded locally)
  • Validation samples: 2,000 samples
  • Source: Mozilla Common Voice 23.0
  • Language: Swahili (sw)

Stage 2 Results:

  • Best validation WER: 23.56% at step 7,500
  • Training continued to 10,000 steps but early stopping applied retrospectively

Baseline Performance:

  • Base Whisper Large v2 (zero-shot): 89.05% WER on Common Voice 17.0 validation

Training Procedure

Training Hyperparameters - Stage 1 (Common Voice 17.0)

Optimization:

  • learning_rate: 1e-05
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
  • training_steps: 10,000

Batch Configuration:

  • train_batch_size: 16
  • eval_batch_size: 16
  • gradient_accumulation_steps: 1

Memory Optimization:

  • gradient_checkpointing: true
  • mixed_precision_training: Native AMP (FP16)
  • dataloader_num_workers: 2

Evaluation & Checkpointing:

  • evaluation_strategy: steps (every 500 steps)
  • save_steps: 500
  • logging_steps: 50

Other:

  • seed: 42

Training Hyperparameters - Stage 2 (Common Voice 23.0)

Optimization:

  • learning_rate: 5e-06 (reduced from Stage 1)
  • lr_scheduler_type: cosine_with_restarts
  • lr_scheduler_warmup_steps: 500
  • optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
  • training_steps: 20,000 (stopped at 10,000, best at 7,500)

Batch Configuration:

  • train_batch_size: 16
  • eval_batch_size: 16

Memory Optimization:

  • mixed_precision_training: Native AMP (FP16)

Evaluation & Checkpointing:

  • evaluation_strategy: steps (every 500 steps)
  • save_steps: 500
  • logging_steps: 50

Other:

  • seed: 42

Training Results - Stage 2 (Common Voice 23.0)

Training Loss Epoch Step Validation Loss WER
0.0598 0.025 500 0.3869 24.8021
0.0488 0.05 1000 0.4222 26.9086
0.0481 0.075 1500 0.4376 26.9910
0.0804 0.1 2000 0.4429 29.2177
0.0433 0.125 2500 0.4569 26.8025
0.0939 1.0043 3000 0.4150 27.8982
0.0875 1.0293 3500 0.4083 24.3874
0.0237 1.0543 4000 0.4526 24.2578
0.0246 1.0793 4500 0.4664 24.9647
0.0267 1.1043 5000 0.4719 26.1074
0.0239 1.1293 5500 0.4533 24.6583
0.0327 2.0086 6000 0.4381 23.6923
0.0254 2.0336 6500 0.4369 23.7512
0.0155 2.0586 7000 0.4463 23.6216
0.0263 2.0836 7500 0.4469 23.5627 โ† Best checkpoint
0.0249 2.1086 8000 0.4821 25.9189
0.0233 2.1336 8500 0.4914 27.0500
0.036 3.0129 9000 0.4738 24.1517
0.0485 3.0379 9500 0.4758 24.9647
0.0132 3.0629 10000 0.5175 25.5655

Training Observations:

  • Initial performance on CV23: 24.80% WER (step 500)
  • Progressive improvement to best WER of 23.56% at step 7,500
  • Performance degraded slightly after step 7,500 (overfitting indicators)
  • Model weights restored to step 7,500 checkpoint for optimal performance

Combined Training Summary

Stage 1 (CV 17.0):

  • Steps: 0 โ†’ 10,000
  • Starting WER: 43.68% โ†’ Final WER: 23.62%

Stage 2 (CV 23.0):

  • Steps: 0 โ†’ 7,500 (best checkpoint)
  • Starting WER: 24.80% โ†’ Best WER: 23.56%

Total Effective Training: ~17,500 steps across two datasets

Performance Comparison

Model Dataset Split WER Improvement from Baseline
Whisper Large v2 (baseline) CV 17.0 Validation 89.05% -
This model (Stage 1) CV 17.0 Validation 23.62% -65.43 pp (73.5% reduction)
This model (Stage 2 - Best) CV 23.0 Validation 23.56% -65.49 pp (73.5% reduction)

Note: The two-stage training approach with dataset progression (CV 17.0 โ†’ CV 23.0) achieved marginal improvement in final WER while ensuring model robustness across Common Voice versions.

Usage

from transformers import pipeline

# Load the model
pipe = pipeline("automatic-speech-recognition",
                model="openchs/asr-whisper-helpline-sw-v1")

# Transcribe audio
result = pipe("path/to/swahili_audio.wav")
print(result["text"])

Advanced Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
processor = WhisperProcessor.from_pretrained("openchs/asr-whisper-helpline-sw-v1")
model = WhisperForConditionalGeneration.from_pretrained("openchs/asr-whisper-helpline-sw-v1")

# Load and process audio
# ... your audio loading code ...

# Generate transcription
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Future Work

  • Domain Evaluation: Assessment on actual Tanzania Child Helpline call center audio to measure domain shift impact
  • Domain Adaptation: Fine-tuning on telephony/call center audio for improved production performance
  • Error Analysis: Detailed analysis of failure cases to identify improvement opportunities
  • Test Set Evaluation: Comprehensive evaluation on Common Voice 23.0 test split

Citation

If you use this model, please cite:

@misc{openchs-swahili-asr-v1,
  title={Swahili ASR Model for Tanzania Child Helpline},
  author={OpenCHS Team},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/openchs/asr-whisper-helpline-sw-v1}}
}

Framework Versions

  • Transformers: 4.56.2
  • PyTorch: 2.8.0+cu128
  • Datasets: 2.21.0
  • Tokenizers: 0.22.1

License

Apache 2.0

Acknowledgments

Downloads last month
570
Safetensors
Model size
2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for openchs/asr-whisper-helpline-sw-v1

Finetuned
(231)
this model
Finetunes
1 model

Evaluation results