--- language: - sw - en license: apache-2.0 tags: - translation - swahili - english - opus-mt - openchs datasets: - nllb - ccaligned metrics: - bleu - chrf - comet --- # Swahili-English Translation Model for Child Helpline Services ## Model Description This model is a fine-tuned version of `Helsinki-NLP/opus-mt-mul-en` for Swahili-to-English translation, specifically optimized for child helpline call transcriptions in East Africa (Tanzania, Uganda, Kenya). **Developed by:** BITZ IT Consulting Ltd **Project:** OpenCHS (Open Child Helpline System) **Funded by:** UNICEF Venture Fund **License:** Apache 2.0 ## Training Data The model was fine-tuned on a combination of: - NLLB Swahili-English parallel corpus (high quality, weight: 1.0) - CCAligned web-crawled parallel data (supplementary, weight: 0.7) - Total training samples: Approximately 50,000+ sentence pairs - Domain focus: Conversational Swahili from helpline contexts ## Performance ### Test Set (General Translation) - **BLEU:** 0.1277 - **chrF:** 32.30 - **Improvement over baseline:** +0.0% ### Domain Evaluation (Call Transcriptions) - **Domain BLEU:** 0.0000 - **Domain chrF:** 1.86 - **Domain COMET-QE:** 0.0000 *Domain metrics are evaluated on real 10-minute call transcriptions from child helplines.* ## Intended Use **Primary Use Case:** Translating Swahili helpline call transcriptions to English for: - Case documentation and reporting - Quality assurance and supervision - Cross-border case referrals - Data analysis and insights **Languages:** Swahili (source) → English (target) **Limitations:** - Optimized for conversational Swahili (East African dialects) - May not perform well on formal/literary Swahili - Best for text lengths under 512 tokens - Requires post-editing for critical use cases ## Usage ```python from transformers import MarianTokenizer, MarianMTModel model_name = "YOUR_USERNAME/brendaogutu/sw-en-translation-v1" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) # Translate swahili_text = "Habari za asubuhi. Ninaitwa Amina na nina miaka 14." inputs = tokenizer(swahili_text, return_tensors="pt", padding=True) outputs = model.generate(**inputs, num_beams=5, max_length=256) translation = tokenizer.decode(outputs[0], skip_special_tokens=True) print(translation) ``` ## Training Details **Base Model:** Helsinki-NLP/opus-mt-mul-en **Training Epochs:** 8 **Batch Size:** 550 (effective: 550) **Learning Rate:** 3e-05 **Optimizer:** AdamW **Hardware:** NVIDIA GPU with FP16 mixed precision ## Evaluation Methodology 1. **Test Set:** Random 5% split from training distribution 2. **Domain Evaluation:** Held-out set of real helpline call transcriptions (10-min calls) 3. **Metrics:** BLEU, chrF, COMET-QE (quality estimation) ## Citation ```bibtex @software{openchs_translation_2025, author = {BITZ IT Consulting Ltd}, title = {Swahili-English Translation Model for OpenCHS}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/YOUR_USERNAME/brendaogutu/sw-en-translation-v1} } ``` ## Contact For questions or issues, please contact: brenda@openchs.org --- *This model is part of the OpenCHS project supporting child helpline services across East Africa.*