Gemma3-Singlish-Sinhala-CodeMix

A fine-tuned Gemma-3-270M-IT model for Singlish -> Sinhala transliteration and code-mixed text translation.

The model handles both:

Pure Singlish: Roman-script Sinhala → Sinhala Unicode
Code-mixed text: English-Sinhala mixed text → pure Sinhala (English words translated to Sinhala)

Examples

Input	Output	Task
`ara politician wa mama dannawa`	අර දේශපාලකයෝ මම දන්නවා	Code-mix
`meka harima boring panthiyak`	මේක හරිම කම්මැලි පන්තියක්	Code-mix
`oyage phone eka ko`	ඔයාගේ දුරකථනය කෝ	Code-mix
`mama gedara yanawa`	මම ගෙදර යනවා	Transliteration
`oyata kohomada`	ඔයාට කොහොමද	Transliteration

Evaluation Results

Phonetic Test Set

Metric	Score
CER	0.0191
WER	0.0876
Exact Match Accuracy	41.01%

Adhoc Test Set

Metric	Score
CER	0.0416
WER	0.1484
Exact Match Accuracy	22.17%

Sample Predictions

Phonetic Test


Input	`awankawama mata eya mathaka ethi akaraya eyayi namuth eya wikarayaki mama obata ashwaya kerehi wedi elmak nodakwana namuth eya mage wilasithawa nowe`
Target	අවංකවම මට එය මතක ඇති ආකාරය එයයි නමුත් එය විකාරයකි මම ඔබට අශ්වයා කෙරෙහි වැඩි ඇල්මක් නොදක්වන නමුත් එය මගේ විලාසිතාව නොවේ
Pred	අවංකවම මට එය මතක ඇති ආකාරය එයයි නමුත් එය විකාරයකි මම ඔබට අශ්වයා කෙරෙහි වැඩි ඇල්මක් නොදක්වන නමුත් එය මගේ විලාසිතාව නොවේ ✓


Input	`oba mage aneka yeyi adahas karanne kese ho oba ema wilasithawata andinne mandeyi mama asami`
Target	ඔබ මගේ අනෙකා යැයි අදහස් කරන්නේ කෙසේ හෝ ඔබ එම විලාසිතාවට අඳින්නේ මන්දැයි මම අසමි
Pred	ඔබ මගේ අනෙකා යැයි අදහස් කරන්නේ කෙසේ හෝ ඔබ එම විලාසිතාවට අඳින්නේ මන්දැයි මම අසමි ✓

Adhoc Test


Input	`kmk nehe modyi wge`
Target	කමක් නැහැ මෝඩයි වගේ
Pred	කමක් නැහැ මෝඩයි වගේ ✓


Input	`eya wda honda deyi asanna`
Target	එය වඩා හොඳ දැයි අසන්න
Pred	එය වඩා හොඳ දැයි අසන්න ✓

Code-Mix Translation

Input	Prediction
`ara politician wa mama dannawa`	අර දේශපාලකයෝ මම දන්නවා
`meka harima boring panthiyak`	මේක හරිම කම්මැලි පන්තියක්
`oyage phone eka ko`	ඔයාගේ දුරකථනය කෝ

Training

Architecture & Method

Base model: google/gemma-3-270m-it (273M parameters)
Method: Multi-phase LoRA fine-tuning
Saved precision: bfloat16

Phase 1–3: Transliteration Foundation

The base transliteration model (Gemma3-Singlish-Sinhala-Merged) was trained in 3 progressive phases on the Swa-bhasha phonetic dataset:

Phase	Data	Epochs	Learning Rate	LoRA Config
Phase 1: Foundation	650K phonetic samples + augmentation	2	1e-4	r=64, α=128
Phase 2: Expansion	350K new + adhoc + replay + augmentation	2	5e-5	r=64, α=128
Phase 3: Mastery	Adhoc-heavy + phonetic mix + augmentation	2	2e-5	r=64, α=128

Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

After each phase, the LoRA adapter was merged into the base model.

Phase 4: Code-Mix Fine-tuning

A new LoRA adapter was trained on top of the merged Phase 3 model to add code-mix translation capability:

Parameter	Value
Training data	~57K samples (77% code-mix, 23% replay)
Code-mix data	~27K unique samples × 2 upsample
Replay data	10K phonetic + 3K adhoc (anti-forgetting)
LoRA rank	32
LoRA alpha	64
LoRA dropout	0.05
Target modules	q, k, v, o, gate, up, down projections
Learning rate	3e-5 (cosine decay)
Effective batch size	32
Epochs	3 (with early stopping, patience=5)
Training precision	float32 (P100 stability)
Saved precision	bfloat16
Hardware	NVIDIA Tesla P100 16GB

Anti-Forgetting Strategy

Replay buffer: 13K transliteration samples mixed into code-mix training
Small LoRA rank (r=32): limits capacity to prevent overwriting base knowledge
Low learning rate (3e-5): gentle updates to existing weights

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import warnings, os
warnings.filterwarnings("ignore")

MODEL_ID = "savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda" if torch.cuda.is_available() else "cpu")
model.eval()

PROMPT = "Translate the following code-mixed text into pure Sinhala:\n{input}\nSinhala:"

gen_config = GenerationConfig(
    max_new_tokens=256, num_beams=3, do_sample=False,
    repetition_penalty=1.2, pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id, top_p=None, top_k=None,
)

def convert(text):
    prompt = PROMPT.format(input=text)
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=True).to(model.device)
    in_len = inputs["input_ids"].shape[1]
    with torch.no_grad():
        out = model.generate(**inputs, generation_config=gen_config)
    return tokenizer.decode(out[0, in_len:], skip_special_tokens=True).strip()

# Transliteration
print(convert("mama gedara yanawa"))            

# Code-mix translation
print(convert("ara politician wa mama dannawa"))  
print(convert("meka harima boring panthiyak"))   
print(convert("oyage phone eka ko"))

Prompt Format

Single unified prompt for both tasks:

"Transliterate the following Singlish text to Sinhala by considering the context:\n{input}\nSinhala:"

Limitations

Conjunct consonants: Occasionally produces ්ය instead of ්‍ය (ZWJ variants)
Punctuation: May drop or alter commas and periods
Rare English words: Very uncommon English words in code-mix may be transliterated phonetically instead of translated
Long sequences: Quality may degrade for inputs longer than ~200 tokens

Model Details

Property	Value
Parameters	273M
Precision	bfloat16
Model size	~536MB
Context length	256 tokens
Languages	Sinhala (si), English (en)
License	Apache 2.0
Base model	google/gemma-3-270m-it

Citation

If you use this model, please cite:

@misc{gunarathna2025codemix,
  title={Gemma3-Singlish-Sinhala-CodeMix: Multi-Phase LoRA Fine-Tuning for Singlish-to-Sinhala Transliteration and Code-Mix Translation},
  author={Gunarathna, Savinu},
  year={2025},
  url={https://huggingface.co/savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix}
}

Related Work

@article{sumanathilaka2025swa,
  title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
  author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
  journal={arXiv preprint arXiv:2507.09245},
  year={2025}
}

@article{ranasinghe2022sold,
  title={SOLD: Sinhala Offensive Language Dataset},
  author={Ranasinghe, Tharindu and Anuradha, Isuri and Premasiri, Damith and Silva, Kanishka and Hettiarachchi, Hansi and Uyangodage, Lasitha and Zampieri, Marcos},
  journal={arXiv preprint arXiv:2212.00851},
  year={2022}
}

@inproceedings{Nsina2024,
  author={Hettiarachchi, Hansi and Premasiri, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu},
  title={{NSINA: A News Corpus for Sinhala}},
  booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  year={2024},
  month={May},
}