|
|
--- |
|
|
language: |
|
|
- sw |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- translation |
|
|
- swahili |
|
|
- english |
|
|
- opus-mt |
|
|
- openchs |
|
|
datasets: |
|
|
- nllb |
|
|
- ccaligned |
|
|
metrics: |
|
|
- bleu |
|
|
- chrf |
|
|
- comet |
|
|
--- |
|
|
|
|
|
# Swahili-English Translation Model for Child Helpline Services |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a fine-tuned version of `Helsinki-NLP/opus-mt-mul-en` for Swahili-to-English translation, specifically optimized for child helpline call transcriptions in East Africa (Tanzania, Uganda, Kenya). |
|
|
|
|
|
**Developed by:** BITZ IT Consulting Ltd |
|
|
**Project:** OpenCHS (Open Child Helpline System) |
|
|
**Funded by:** UNICEF Venture Fund |
|
|
**License:** Apache 2.0 |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was fine-tuned on a combination of: |
|
|
- NLLB Swahili-English parallel corpus (high quality, weight: 1.0) |
|
|
- CCAligned web-crawled parallel data (supplementary, weight: 0.7) |
|
|
- Total training samples: Approximately 50,000+ sentence pairs |
|
|
- Domain focus: Conversational Swahili from helpline contexts |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Test Set (General Translation) |
|
|
- **BLEU:** 0.1277 |
|
|
- **chrF:** 32.30 |
|
|
- **Improvement over baseline:** +0.0% |
|
|
|
|
|
### Domain Evaluation (Call Transcriptions) |
|
|
- **Domain BLEU:** 0.0000 |
|
|
- **Domain chrF:** 1.86 |
|
|
- **Domain COMET-QE:** 0.0000 |
|
|
|
|
|
*Domain metrics are evaluated on real 10-minute call transcriptions from child helplines.* |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
**Primary Use Case:** Translating Swahili helpline call transcriptions to English for: |
|
|
- Case documentation and reporting |
|
|
- Quality assurance and supervision |
|
|
- Cross-border case referrals |
|
|
- Data analysis and insights |
|
|
|
|
|
**Languages:** Swahili (source) → English (target) |
|
|
|
|
|
**Limitations:** |
|
|
- Optimized for conversational Swahili (East African dialects) |
|
|
- May not perform well on formal/literary Swahili |
|
|
- Best for text lengths under 512 tokens |
|
|
- Requires post-editing for critical use cases |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import MarianTokenizer, MarianMTModel |
|
|
|
|
|
model_name = "YOUR_USERNAME/brendaogutu/sw-en-translation-v1" |
|
|
tokenizer = MarianTokenizer.from_pretrained(model_name) |
|
|
model = MarianMTModel.from_pretrained(model_name) |
|
|
|
|
|
# Translate |
|
|
swahili_text = "Habari za asubuhi. Ninaitwa Amina na nina miaka 14." |
|
|
inputs = tokenizer(swahili_text, return_tensors="pt", padding=True) |
|
|
outputs = model.generate(**inputs, num_beams=5, max_length=256) |
|
|
translation = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(translation) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
**Base Model:** Helsinki-NLP/opus-mt-mul-en |
|
|
**Training Epochs:** 8 |
|
|
**Batch Size:** 550 (effective: 550) |
|
|
**Learning Rate:** 3e-05 |
|
|
**Optimizer:** AdamW |
|
|
**Hardware:** NVIDIA GPU with FP16 mixed precision |
|
|
|
|
|
## Evaluation Methodology |
|
|
|
|
|
1. **Test Set:** Random 5% split from training distribution |
|
|
2. **Domain Evaluation:** Held-out set of real helpline call transcriptions (10-min calls) |
|
|
3. **Metrics:** BLEU, chrF, COMET-QE (quality estimation) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@software{openchs_translation_2025, |
|
|
author = {BITZ IT Consulting Ltd}, |
|
|
title = {Swahili-English Translation Model for OpenCHS}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/YOUR_USERNAME/brendaogutu/sw-en-translation-v1} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please contact: brenda@openchs.org |
|
|
|
|
|
--- |
|
|
|
|
|
*This model is part of the OpenCHS project supporting child helpline services across East Africa.* |
|
|
|