brendaogutu's picture
Upload fine-tuned model - BLEU: 0.1277
4182235 verified
|
raw
history blame
3.29 kB
---
language:
- sw
- en
license: apache-2.0
tags:
- translation
- swahili
- english
- opus-mt
- openchs
datasets:
- nllb
- ccaligned
metrics:
- bleu
- chrf
- comet
---
# Swahili-English Translation Model for Child Helpline Services
## Model Description
This model is a fine-tuned version of `Helsinki-NLP/opus-mt-mul-en` for Swahili-to-English translation, specifically optimized for child helpline call transcriptions in East Africa (Tanzania, Uganda, Kenya).
**Developed by:** BITZ IT Consulting Ltd
**Project:** OpenCHS (Open Child Helpline System)
**Funded by:** UNICEF Venture Fund
**License:** Apache 2.0
## Training Data
The model was fine-tuned on a combination of:
- NLLB Swahili-English parallel corpus (high quality, weight: 1.0)
- CCAligned web-crawled parallel data (supplementary, weight: 0.7)
- Total training samples: Approximately 50,000+ sentence pairs
- Domain focus: Conversational Swahili from helpline contexts
## Performance
### Test Set (General Translation)
- **BLEU:** 0.1277
- **chrF:** 32.30
- **Improvement over baseline:** +0.0%
### Domain Evaluation (Call Transcriptions)
- **Domain BLEU:** 0.0000
- **Domain chrF:** 1.86
- **Domain COMET-QE:** 0.0000
*Domain metrics are evaluated on real 10-minute call transcriptions from child helplines.*
## Intended Use
**Primary Use Case:** Translating Swahili helpline call transcriptions to English for:
- Case documentation and reporting
- Quality assurance and supervision
- Cross-border case referrals
- Data analysis and insights
**Languages:** Swahili (source) → English (target)
**Limitations:**
- Optimized for conversational Swahili (East African dialects)
- May not perform well on formal/literary Swahili
- Best for text lengths under 512 tokens
- Requires post-editing for critical use cases
## Usage
```python
from transformers import MarianTokenizer, MarianMTModel
model_name = "YOUR_USERNAME/brendaogutu/sw-en-translation-v1"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Translate
swahili_text = "Habari za asubuhi. Ninaitwa Amina na nina miaka 14."
inputs = tokenizer(swahili_text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, num_beams=5, max_length=256)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
```
## Training Details
**Base Model:** Helsinki-NLP/opus-mt-mul-en
**Training Epochs:** 8
**Batch Size:** 550 (effective: 550)
**Learning Rate:** 3e-05
**Optimizer:** AdamW
**Hardware:** NVIDIA GPU with FP16 mixed precision
## Evaluation Methodology
1. **Test Set:** Random 5% split from training distribution
2. **Domain Evaluation:** Held-out set of real helpline call transcriptions (10-min calls)
3. **Metrics:** BLEU, chrF, COMET-QE (quality estimation)
## Citation
```bibtex
@software{openchs_translation_2025,
author = {BITZ IT Consulting Ltd},
title = {Swahili-English Translation Model for OpenCHS},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/YOUR_USERNAME/brendaogutu/sw-en-translation-v1}
}
```
## Contact
For questions or issues, please contact: brenda@openchs.org
---
*This model is part of the OpenCHS project supporting child helpline services across East Africa.*