small-100-Singlish-Sinhala-CodeMix
A code-mixed Singlish → Sinhala translation model built on top of mBART small-100, fine-tuned in two stages using LoRA. This model handles the everyday Sri Lankan reality of switching between English and Sinhala mid-sentence — something standard translation systems consistently fail at.
What It Does
Given a code-mixed input like:
mama hungry, kanna yamu
uber eken gihilla mall ekata yamu
meeting eka cancel una
It outputs clean Sinhala:
මම බඩගිනි, කන්න යමු
උබර් එකෙන් ගිහිල්ලා මල් එකට යමු
මුණ ගැසීම අවලංගු වුණා
It also handles pure Singlish transliteration:
mama gedara yanawa → මම ගෙදර යනවා
mata kaema one → මට කෑම ඕනේ
oyata kohomada → ඔයාට කොහොමද
Training
Stage 1 — Base Fine-tune
The starting point is savinugunarathna/Small100-Singlish-Sinhala-Merged, itself a LoRA fine-tune of mBART small-100 trained on ~1M phonetic Singlish–Sinhala pairs plus an adhoc vocabulary set.
Stage 2 — Code-Mix Fine-tune (this model)
LoRA was applied on top of the merged stage-1 weights and trained on a purpose-built code-mixed dataset with catastrophic forgetting prevention via replay.
| Parameter | Value |
|---|---|
| Base model | savinugunarathna/Small100-Singlish-Sinhala-Merged |
| Method | LoRA (r=32, α=64, dropout=0.05) |
| LoRA targets | q_proj, k_proj, v_proj, out_proj, fc1, fc2 |
| Trainable params | ~5.5% of total |
| Epochs | 3 |
| Batch size | 8 × 4 grad accum = 32 effective |
| Learning rate | 3e-5 (cosine schedule, 5% warmup) |
| Hardware | NVIDIA P100 16GB |
| Precision | float32 (training) |
Data Composition
| Source | Samples | Role |
|---|---|---|
| Code-mixed pairs | full set × 2 upsample | Primary target |
| Phonetic Singlish–Sinhala | 15,000 | Replay — prevents forgetting |
| Adhoc vocabulary | 5,000 | Replay — preserves edge cases |
Evaluation
Final evaluation on held-out test sets after merge:
| Test Set | CER ↓ | WER ↓ | Exact Match ↑ |
|---|---|---|---|
| Phonetic | 0.0211 | 0.1015 | 31.0% |
| Adhoc | 0.0458 | 0.1583 | 25.5% |
CER of 0.021 on phonetic means the model gets the character sequence right roughly 98% of the time — the remaining errors are mostly minor orthographic variants (e.g. මහලු vs මහළු) rather than meaning-altering mistakes.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "savinugunarathna/small-100-Singlish-Sinhala-CodeMix"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer.src_lang = "en"
tgt_lang_id = tokenizer.lang_code_to_id["si"]
def translate(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
output = model.generate(
**inputs,
forced_bos_token_id=tgt_lang_id,
num_beams=3,
max_new_tokens=128,
repetition_penalty=1.2,
)
decoded = tokenizer.decode(output[0], skip_special_tokens=True).strip()
if decoded.startswith("__si__"):
decoded = decoded[len("__si__"):].strip()
return decoded
print(translate("mama hungry, kanna yamu"))
# → මම බඩගිනි, කන්න යමු
Limitations
- Trained primarily on text-based Singlish. Heavy slang, regional dialect variation, and novel English loanwords may degrade output quality.
- Punctuation handling in the adhoc domain is inconsistent — the model sometimes drops commas present in reference translations.
- Not evaluated on formal or written Sinhala; this model is optimised for conversational register.
Model Family
| Model | Description |
|---|---|
alirezamsh/small100 |
Base multilingual model |
savinugunarathna/Small100-Singlish-Sinhala-Merged |
Stage 1 — phonetic + adhoc fine-tune |
savinugunarathna/small-100-Singlish-Sinhala-CodeMix |
Stage 2 — this model, code-mix specialisation |
- Downloads last month
- 201