small-100-Singlish-Sinhala-CodeMix

A code-mixed Singlish → Sinhala translation model built on top of mBART small-100, fine-tuned in two stages using LoRA. This model handles the everyday Sri Lankan reality of switching between English and Sinhala mid-sentence — something standard translation systems consistently fail at.

What It Does

Given a code-mixed input like:

mama hungry, kanna yamu
uber eken gihilla mall ekata yamu
meeting eka cancel una

It outputs clean Sinhala:

මම බඩගිනි, කන්න යමු
උබර් එකෙන් ගිහිල්ලා මල් එකට යමු
මුණ ගැසීම අවලංගු වුණා

It also handles pure Singlish transliteration:

mama gedara yanawa   →   මම ගෙදර යනවා
mata kaema one       →   මට කෑම ඕනේ
oyata kohomada       →   ඔයාට කොහොමද

Training

Stage 1 — Base Fine-tune

The starting point is savinugunarathna/Small100-Singlish-Sinhala-Merged, itself a LoRA fine-tune of mBART small-100 trained on ~1M phonetic Singlish–Sinhala pairs plus an adhoc vocabulary set.

Stage 2 — Code-Mix Fine-tune (this model)

LoRA was applied on top of the merged stage-1 weights and trained on a purpose-built code-mixed dataset with catastrophic forgetting prevention via replay.

Parameter	Value
Base model	`savinugunarathna/Small100-Singlish-Sinhala-Merged`
Method	LoRA (r=32, α=64, dropout=0.05)
LoRA targets	q_proj, k_proj, v_proj, out_proj, fc1, fc2
Trainable params	~5.5% of total
Epochs	3
Batch size	8 × 4 grad accum = 32 effective
Learning rate	3e-5 (cosine schedule, 5% warmup)
Hardware	NVIDIA P100 16GB
Precision	float32 (training)

Data Composition

Source	Samples	Role
Code-mixed pairs	full set × 2 upsample	Primary target
Phonetic Singlish–Sinhala	15,000	Replay — prevents forgetting
Adhoc vocabulary	5,000	Replay — preserves edge cases

Evaluation

Final evaluation on held-out test sets after merge:

Test Set	CER ↓	WER ↓	Exact Match ↑
Phonetic	0.0211	0.1015	31.0%
Adhoc	0.0458	0.1583	25.5%

CER of 0.021 on phonetic means the model gets the character sequence right roughly 98% of the time — the remaining errors are mostly minor orthographic variants (e.g. මහලු vs මහළු) rather than meaning-altering mistakes.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id  = "savinugunarathna/small-100-Singlish-Sinhala-CodeMix"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForSeq2SeqLM.from_pretrained(model_id)

tokenizer.src_lang = "en"
tgt_lang_id = tokenizer.lang_code_to_id["si"]

def translate(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    output = model.generate(
        **inputs,
        forced_bos_token_id=tgt_lang_id,
        num_beams=3,
        max_new_tokens=128,
        repetition_penalty=1.2,
    )
    decoded = tokenizer.decode(output[0], skip_special_tokens=True).strip()
    if decoded.startswith("__si__"):
        decoded = decoded[len("__si__"):].strip()
    return decoded

print(translate("mama hungry, kanna yamu"))
# → මම බඩගිනි, කන්න යමු

Limitations

Trained primarily on text-based Singlish. Heavy slang, regional dialect variation, and novel English loanwords may degrade output quality.
Punctuation handling in the adhoc domain is inconsistent — the model sometimes drops commas present in reference translations.
Not evaluated on formal or written Sinhala; this model is optimised for conversational register.

Model Family

Model	Description
`alirezamsh/small100`	Base multilingual model
`savinugunarathna/Small100-Singlish-Sinhala-Merged`	Stage 1 — phonetic + adhoc fine-tune
`savinugunarathna/small-100-Singlish-Sinhala-CodeMix`	Stage 2 — this model, code-mix specialisation

Downloads last month: 201

Safetensors

Model size

0.3B params

Tensor type

BF16

Model tree for savinugunarathna/small-100-Singlish-Sinhala-CodeMix

Base model

savinugunarathna/Small100-Singlish-Sinhala-Merged

Adapter

(1)

this model

savinugunarathna
/

small-100-Singlish-Sinhala-CodeMix