asr-whisper-helpline-sw-v1

This model is a fine-tuned version of openai/whisper-large-v2 on the Common Voice Swahili dataset.

Model Description

This ASR model is specifically fine-tuned for Swahili speech recognition in the context of the Tanzania Child Helpline, powered by OpenCHS (Open Source Child Helpline System). The model is designed to transcribe Swahili spoken in Tanzanian call center environments.

Performance Highlights:

Best Validation WER: 23.56% (achieved at step 7,500 of continued training)
Baseline WER: 89.05% (Whisper Large v2 zero-shot on Common Voice 17.0)
Improvement: ~~65.5 percentage point reduction in WER (~~73.5% error rate reduction)

This represents a significant improvement over the base Whisper Large v2 model for Swahili transcription tasks.

Training Strategy

The model was trained in two stages:

Stage 1 - Common Voice 17.0: Initial fine-tuning on Common Voice 17.0 Swahili dataset (10,000 steps)
Stage 2 - Common Voice 23.0: Continued fine-tuning on Common Voice 23.0 Swahili dataset (7,500 steps)

Total Training: 17,500 effective steps with the best checkpoint selected at step 7,500 of stage 2 based on lowest validation WER.

Intended Uses & Limitations

Intended Uses

Primary: Transcribing Swahili speech in call center environments, specifically for child helpline services in Tanzania
General: Swahili automatic speech recognition tasks
Research: Baseline for domain adaptation studies (general speech → telephony/call center audio)

Limitations

Domain Shift: Model is trained on Common Voice (clean, read speech) but intended for call center audio. Performance on actual telephony audio may differ and requires validation.
Language Variety: Training data may not fully represent all Tanzanian Swahili dialects and speaking styles.
Audio Quality: Performance may degrade with low-quality audio, background noise, or poor recording conditions typical in telephony.
Code-Switching: May not handle code-switching between Swahili and English/other languages well.

Known Issues

Domain-specific evaluation on actual call center audio is pending

Training and Evaluation Data

Stage 1: Common Voice 17.0 (Swahili)

Training Configuration:

Training samples: Streamed entire Common Voice 17.0 Swahili training split
Validation samples: 2,000 samples
Source: Mozilla Common Voice 17.0
Language: Swahili (sw)
Data type: Read speech from diverse speakers
Streaming mode: Used dataset streaming to minimize disk usage

Stage 1 Results:

Final validation WER: 23.62%
Training steps: 10,000

Stage 2: Common Voice 23.0 (Swahili)

Training Configuration:

Starting point: Best checkpoint from Stage 1
Training samples: Common Voice 23.0 Swahili training split (downloaded locally)
Validation samples: 2,000 samples
Source: Mozilla Common Voice 23.0
Language: Swahili (sw)

Stage 2 Results:

Best validation WER: 23.56% at step 7,500
Training continued to 10,000 steps but early stopping applied retrospectively

Baseline Performance:

Base Whisper Large v2 (zero-shot): 89.05% WER on Common Voice 17.0 validation

Training Procedure

Training Hyperparameters - Stage 1 (Common Voice 17.0)

Optimization:

learning_rate: 1e-05
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
training_steps: 10,000

Batch Configuration:

train_batch_size: 16
eval_batch_size: 16
gradient_accumulation_steps: 1

Memory Optimization:

gradient_checkpointing: true
mixed_precision_training: Native AMP (FP16)
dataloader_num_workers: 2

Evaluation & Checkpointing:

evaluation_strategy: steps (every 500 steps)
save_steps: 500
logging_steps: 50

Other:

seed: 42

Training Hyperparameters - Stage 2 (Common Voice 23.0)

Optimization:

learning_rate: 5e-06 (reduced from Stage 1)
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_steps: 500
optimizer: AdamW (torch) with betas=(0.9, 0.999) and epsilon=1e-08
training_steps: 20,000 (stopped at 10,000, best at 7,500)

Batch Configuration:

train_batch_size: 16
eval_batch_size: 16

Memory Optimization:

mixed_precision_training: Native AMP (FP16)

Evaluation & Checkpointing:

evaluation_strategy: steps (every 500 steps)
save_steps: 500
logging_steps: 50

Other:

seed: 42

Training Results - Stage 2 (Common Voice 23.0)

Training Loss	Epoch	Step	Validation Loss	WER
0.0598	0.025	500	0.3869	24.8021
0.0488	0.05	1000	0.4222	26.9086
0.0481	0.075	1500	0.4376	26.9910
0.0804	0.1	2000	0.4429	29.2177
0.0433	0.125	2500	0.4569	26.8025
0.0939	1.0043	3000	0.4150	27.8982
0.0875	1.0293	3500	0.4083	24.3874
0.0237	1.0543	4000	0.4526	24.2578
0.0246	1.0793	4500	0.4664	24.9647
0.0267	1.1043	5000	0.4719	26.1074
0.0239	1.1293	5500	0.4533	24.6583
0.0327	2.0086	6000	0.4381	23.6923
0.0254	2.0336	6500	0.4369	23.7512
0.0155	2.0586	7000	0.4463	23.6216
0.0263	2.0836	7500	0.4469	23.5627 ← Best checkpoint
0.0249	2.1086	8000	0.4821	25.9189
0.0233	2.1336	8500	0.4914	27.0500
0.036	3.0129	9000	0.4738	24.1517
0.0485	3.0379	9500	0.4758	24.9647
0.0132	3.0629	10000	0.5175	25.5655

Training Observations:

Initial performance on CV23: 24.80% WER (step 500)
Progressive improvement to best WER of 23.56% at step 7,500
Performance degraded slightly after step 7,500 (overfitting indicators)
Model weights restored to step 7,500 checkpoint for optimal performance

Combined Training Summary

Stage 1 (CV 17.0):

Steps: 0 → 10,000
Starting WER: 43.68% → Final WER: 23.62%

Stage 2 (CV 23.0):

Steps: 0 → 7,500 (best checkpoint)
Starting WER: 24.80% → Best WER: 23.56%

Total Effective Training: ~17,500 steps across two datasets

Performance Comparison

Model	Dataset	Split	WER	Improvement from Baseline
Whisper Large v2 (baseline)	CV 17.0	Validation	89.05%	-
This model (Stage 1)	CV 17.0	Validation	23.62%	-65.43 pp (73.5% reduction)
This model (Stage 2 - Best)	CV 23.0	Validation	23.56%	-65.49 pp (73.5% reduction)

Note: The two-stage training approach with dataset progression (CV 17.0 → CV 23.0) achieved marginal improvement in final WER while ensuring model robustness across Common Voice versions.

Usage

from transformers import pipeline

# Load the model
pipe = pipeline("automatic-speech-recognition",
                model="openchs/asr-whisper-helpline-sw-v1")

# Transcribe audio
result = pipe("path/to/swahili_audio.wav")
print(result["text"])

Advanced Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
processor = WhisperProcessor.from_pretrained("openchs/asr-whisper-helpline-sw-v1")
model = WhisperForConditionalGeneration.from_pretrained("openchs/asr-whisper-helpline-sw-v1")

# Load and process audio
# ... your audio loading code ...

# Generate transcription
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Future Work

Domain Evaluation: Assessment on actual Tanzania Child Helpline call center audio to measure domain shift impact
Domain Adaptation: Fine-tuning on telephony/call center audio for improved production performance
Error Analysis: Detailed analysis of failure cases to identify improvement opportunities
Test Set Evaluation: Comprehensive evaluation on Common Voice 23.0 test split

Citation

If you use this model, please cite:

@misc{openchs-swahili-asr-v1,
  title={Swahili ASR Model for Tanzania Child Helpline},
  author={OpenCHS Team},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/openchs/asr-whisper-helpline-sw-v1}}
}