# Swahili-English Translation Model (General Domain Expansion)

This model is a fine-tuned version of [Helsinki-NLP/opus-mt-mul-en](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) 
on a large corpus of general Swahili-English translations while maintaining helpline translation quality.

## Model Details

- **Base Model:** Helsinki-NLP/opus-mt-mul-en
- **Language Pair:** Swahili (sw) → English (en)
- **Training Data:** 
  - CCAligned general corpus (~200k+ samples)
  - Helpline conversation data (oversampled 5x for domain retention)
- **Special Features:**
  - Domain-aware with `<HELPLINE>` and `<GENERAL>` tags
  - Optimized for both general and helpline translations
  - Knowledge distillation from helpline-specialized model

## Training Procedure

### Memory Optimizations
- CPU teacher offloading
- Gradient checkpointing
- Batch size: 8, Gradient accumulation: 16

### Training Hyperparameters
- Learning rate: 1.5e-5
- Epochs: 1
- Optimizer: AdamW
- LR Scheduler: Cosine with warmup

## Performance

| Domain | BLEU | chrF |
|--------|------|------|
| Helpline | X.XX | XX.X |
| General | X.XX | XX.X |

*(Replace with actual metrics from training)*

## Usage

```python
from transformers import MarianMTModel, MarianTokenizer

# Load model and tokenizer
model_name = "brendaogutu/sw-en-opus-mt-general-expanded"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# For general translations
text = "<GENERAL> Habari za asubuhi"
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)  # "Good morning"

# For helpline translations
text = "<HELPLINE> Ninahitaji msaada wa haraka"
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)  # "I need urgent help"
```

## Limitations

- Optimized for Swahili to English (not bidirectional)
- Best performance with domain tags (<HELPLINE> or <GENERAL>)
- May struggle with very technical or specialized vocabulary outside training domains

## Training Details

- **Framework:** Transformers + PyTorch
- **Hardware:** Single GPU training
- **Training Time:** ~X hours
- **Checkpoint Strategy:** Every 500 steps for power failure recovery

## Citation

If you use this model, please cite:

```bibtex
@misc{{sw-en-general-expanded,
  author = {{Your Name/Organization}},
  title = {{Swahili-English General Domain Translation Model}},
  year = {{2025}},
  publisher = {{HuggingFace}},
  url = {{https://huggingface.co/brendaogutu/sw-en-opus-mt-general-expanded}}
}}
```

## License

This model inherits the license from Helsinki-NLP/opus-mt-mul-en.

## Contact

For questions or issues, please open an issue on the model repository.