small-100-Singlish-Sinhala-CodeMix

A code-mixed Singlish → Sinhala translation model built on top of mBART small-100, fine-tuned in two stages using LoRA. This model handles the everyday Sri Lankan reality of switching between English and Sinhala mid-sentence — something standard translation systems consistently fail at.


What It Does

Given a code-mixed input like:

mama hungry, kanna yamu
uber eken gihilla mall ekata yamu
meeting eka cancel una

It outputs clean Sinhala:

මම බඩගිනි, කන්න යමු
උබර් එකෙන් ගිහිල්ලා මල් එකට යමු
මුණ ගැසීම අවලංගු වුණා

It also handles pure Singlish transliteration:

mama gedara yanawa   →   මම ගෙදර යනවා
mata kaema one       →   මට කෑම ඕනේ
oyata kohomada       →   ඔයාට කොහොමද

Training

Stage 1 — Base Fine-tune

The starting point is savinugunarathna/Small100-Singlish-Sinhala-Merged, itself a LoRA fine-tune of mBART small-100 trained on ~1M phonetic Singlish–Sinhala pairs plus an adhoc vocabulary set.

Stage 2 — Code-Mix Fine-tune (this model)

LoRA was applied on top of the merged stage-1 weights and trained on a purpose-built code-mixed dataset with catastrophic forgetting prevention via replay.

Parameter Value
Base model savinugunarathna/Small100-Singlish-Sinhala-Merged
Method LoRA (r=32, α=64, dropout=0.05)
LoRA targets q_proj, k_proj, v_proj, out_proj, fc1, fc2
Trainable params ~5.5% of total
Epochs 3
Batch size 8 × 4 grad accum = 32 effective
Learning rate 3e-5 (cosine schedule, 5% warmup)
Hardware NVIDIA P100 16GB
Precision float32 (training)

Data Composition

Source Samples Role
Code-mixed pairs full set × 2 upsample Primary target
Phonetic Singlish–Sinhala 15,000 Replay — prevents forgetting
Adhoc vocabulary 5,000 Replay — preserves edge cases

Evaluation

Final evaluation on held-out test sets after merge:

Test Set CER ↓ WER ↓ Exact Match ↑
Phonetic 0.0211 0.1015 31.0%
Adhoc 0.0458 0.1583 25.5%

CER of 0.021 on phonetic means the model gets the character sequence right roughly 98% of the time — the remaining errors are mostly minor orthographic variants (e.g. මහලු vs මහළු) rather than meaning-altering mistakes.


Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id  = "savinugunarathna/small-100-Singlish-Sinhala-CodeMix"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForSeq2SeqLM.from_pretrained(model_id)

tokenizer.src_lang = "en"
tgt_lang_id = tokenizer.lang_code_to_id["si"]

def translate(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    output = model.generate(
        **inputs,
        forced_bos_token_id=tgt_lang_id,
        num_beams=3,
        max_new_tokens=128,
        repetition_penalty=1.2,
    )
    decoded = tokenizer.decode(output[0], skip_special_tokens=True).strip()
    if decoded.startswith("__si__"):
        decoded = decoded[len("__si__"):].strip()
    return decoded

print(translate("mama hungry, kanna yamu"))
# → මම බඩගිනි, කන්න යමු

Limitations

  • Trained primarily on text-based Singlish. Heavy slang, regional dialect variation, and novel English loanwords may degrade output quality.
  • Punctuation handling in the adhoc domain is inconsistent — the model sometimes drops commas present in reference translations.
  • Not evaluated on formal or written Sinhala; this model is optimised for conversational register.

Model Family

Model Description
alirezamsh/small100 Base multilingual model
savinugunarathna/Small100-Singlish-Sinhala-Merged Stage 1 — phonetic + adhoc fine-tune
savinugunarathna/small-100-Singlish-Sinhala-CodeMix Stage 2 — this model, code-mix specialisation
Downloads last month
201
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savinugunarathna/small-100-Singlish-Sinhala-CodeMix

Adapter
(1)
this model

Space using savinugunarathna/small-100-Singlish-Sinhala-CodeMix 1