Gemma3-Singlish-Sinhala-CodeMix
A fine-tuned Gemma-3-270M-IT model for Singlish -> Sinhala transliteration and code-mixed text translation.
The model handles both:
- Pure Singlish: Roman-script Sinhala → Sinhala Unicode
- Code-mixed text: English-Sinhala mixed text → pure Sinhala (English words translated to Sinhala)
Examples
| Input |
Output |
Task |
ara politician wa mama dannawa |
අර දේශපාලකයෝ මම දන්නවා |
Code-mix |
meka harima boring panthiyak |
මේක හරිම කම්මැලි පන්තියක් |
Code-mix |
oyage phone eka ko |
ඔයාගේ දුරකථනය කෝ |
Code-mix |
mama gedara yanawa |
මම ගෙදර යනවා |
Transliteration |
oyata kohomada |
ඔයාට කොහොමද |
Transliteration |
Evaluation Results
Phonetic Test Set
| Metric |
Score |
| CER |
0.0191 |
| WER |
0.0876 |
| Exact Match Accuracy |
41.01% |
Adhoc Test Set
| Metric |
Score |
| CER |
0.0416 |
| WER |
0.1484 |
| Exact Match Accuracy |
22.17% |
Sample Predictions
Phonetic Test
|
|
| Input |
awankawama mata eya mathaka ethi akaraya eyayi namuth eya wikarayaki mama obata ashwaya kerehi wedi elmak nodakwana namuth eya mage wilasithawa nowe |
| Target |
අවංකවම මට එය මතක ඇති ආකාරය එයයි නමුත් එය විකාරයකි මම ඔබට අශ්වයා කෙරෙහි වැඩි ඇල්මක් නොදක්වන නමුත් එය මගේ විලාසිතාව නොවේ |
| Pred |
අවංකවම මට එය මතක ඇති ආකාරය එයයි නමුත් එය විකාරයකි මම ඔබට අශ්වයා කෙරෙහි වැඩි ඇල්මක් නොදක්වන නමුත් එය මගේ විලාසිතාව නොවේ ✓ |
|
|
| Input |
oba mage aneka yeyi adahas karanne kese ho oba ema wilasithawata andinne mandeyi mama asami |
| Target |
ඔබ මගේ අනෙකා යැයි අදහස් කරන්නේ කෙසේ හෝ ඔබ එම විලාසිතාවට අඳින්නේ මන්දැයි මම අසමි |
| Pred |
ඔබ මගේ අනෙකා යැයි අදහස් කරන්නේ කෙසේ හෝ ඔබ එම විලාසිතාවට අඳින්නේ මන්දැයි මම අසමි ✓ |
Adhoc Test
|
|
| Input |
kmk nehe modyi wge |
| Target |
කමක් නැහැ මෝඩයි වගේ |
| Pred |
කමක් නැහැ මෝඩයි වගේ ✓ |
|
|
| Input |
eya wda honda deyi asanna |
| Target |
එය වඩා හොඳ දැයි අසන්න |
| Pred |
එය වඩා හොඳ දැයි අසන්න ✓ |
Code-Mix Translation
| Input |
Prediction |
ara politician wa mama dannawa |
අර දේශපාලකයෝ මම දන්නවා |
meka harima boring panthiyak |
මේක හරිම කම්මැලි පන්තියක් |
oyage phone eka ko |
ඔයාගේ දුරකථනය කෝ |
Training
Architecture & Method
- Base model:
google/gemma-3-270m-it (273M parameters)
- Method: Multi-phase LoRA fine-tuning
- Saved precision: bfloat16
Phase 1–3: Transliteration Foundation
The base transliteration model (Gemma3-Singlish-Sinhala-Merged) was trained in 3 progressive phases on the Swa-bhasha phonetic dataset:
| Phase |
Data |
Epochs |
Learning Rate |
LoRA Config |
| Phase 1: Foundation |
650K phonetic samples + augmentation |
2 |
1e-4 |
r=64, α=128 |
| Phase 2: Expansion |
350K new + adhoc + replay + augmentation |
2 |
5e-5 |
r=64, α=128 |
| Phase 3: Mastery |
Adhoc-heavy + phonetic mix + augmentation |
2 |
2e-5 |
r=64, α=128 |
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
After each phase, the LoRA adapter was merged into the base model.
Phase 4: Code-Mix Fine-tuning
A new LoRA adapter was trained on top of the merged Phase 3 model to add code-mix translation capability:
| Parameter |
Value |
| Training data |
~57K samples (77% code-mix, 23% replay) |
| Code-mix data |
~27K unique samples × 2 upsample |
| Replay data |
10K phonetic + 3K adhoc (anti-forgetting) |
| LoRA rank |
32 |
| LoRA alpha |
64 |
| LoRA dropout |
0.05 |
| Target modules |
q, k, v, o, gate, up, down projections |
| Learning rate |
3e-5 (cosine decay) |
| Effective batch size |
32 |
| Epochs |
3 (with early stopping, patience=5) |
| Training precision |
float32 (P100 stability) |
| Saved precision |
bfloat16 |
| Hardware |
NVIDIA Tesla P100 16GB |
Anti-Forgetting Strategy
- Replay buffer: 13K transliteration samples mixed into code-mix training
- Small LoRA rank (r=32): limits capacity to prevent overwriting base knowledge
- Low learning rate (3e-5): gentle updates to existing weights
Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import warnings, os
warnings.filterwarnings("ignore")
MODEL_ID = "savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to("cuda" if torch.cuda.is_available() else "cpu")
model.eval()
PROMPT = "Translate the following code-mixed text into pure Sinhala:\n{input}\nSinhala:"
gen_config = GenerationConfig(
max_new_tokens=256, num_beams=3, do_sample=False,
repetition_penalty=1.2, pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id, top_p=None, top_k=None,
)
def convert(text):
prompt = PROMPT.format(input=text)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=True).to(model.device)
in_len = inputs["input_ids"].shape[1]
with torch.no_grad():
out = model.generate(**inputs, generation_config=gen_config)
return tokenizer.decode(out[0, in_len:], skip_special_tokens=True).strip()
print(convert("mama gedara yanawa"))
print(convert("ara politician wa mama dannawa"))
print(convert("meka harima boring panthiyak"))
print(convert("oyage phone eka ko"))
Prompt Format
Single unified prompt for both tasks:
"Transliterate the following Singlish text to Sinhala by considering the context:\n{input}\nSinhala:"
Limitations
- Conjunct consonants: Occasionally produces
්ය instead of ්ය (ZWJ variants)
- Punctuation: May drop or alter commas and periods
- Rare English words: Very uncommon English words in code-mix may be transliterated phonetically instead of translated
- Long sequences: Quality may degrade for inputs longer than ~200 tokens
Model Details
| Property |
Value |
| Parameters |
273M |
| Precision |
bfloat16 |
| Model size |
~536MB |
| Context length |
256 tokens |
| Languages |
Sinhala (si), English (en) |
| License |
Apache 2.0 |
| Base model |
google/gemma-3-270m-it |
Citation
If you use this model, please cite:
@misc{gunarathna2025codemix,
title={Gemma3-Singlish-Sinhala-CodeMix: Multi-Phase LoRA Fine-Tuning for Singlish-to-Sinhala Transliteration and Code-Mix Translation},
author={Gunarathna, Savinu},
year={2025},
url={https://huggingface.co/savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix}
}
Related Work
@article{sumanathilaka2025swa,
title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
journal={arXiv preprint arXiv:2507.09245},
year={2025}
}
@article{ranasinghe2022sold,
title={SOLD: Sinhala Offensive Language Dataset},
author={Ranasinghe, Tharindu and Anuradha, Isuri and Premasiri, Damith and Silva, Kanishka and Hettiarachchi, Hansi and Uyangodage, Lasitha and Zampieri, Marcos},
journal={arXiv preprint arXiv:2212.00851},
year={2022}
}
@inproceedings{Nsina2024,
author={Hettiarachchi, Hansi and Premasiri, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu},
title={{NSINA: A News Corpus for Sinhala}},
booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
year={2024},
month={May},
}
Acknowledgments
- Google for the Gemma model family
- Swa-bhasha Resource Hub for the phonetic transliteration dataset
- Hugging Face for the transformers ecosystem
- Training compute provided by Kaggle (P100 GPU)