Gemma3-Singlish-Sinhala-CodeMix

A fine-tuned Gemma-3-270M-IT model for Singlish -> Sinhala transliteration and code-mixed text translation.

The model handles both:

  • Pure Singlish: Roman-script Sinhala → Sinhala Unicode
  • Code-mixed text: English-Sinhala mixed text → pure Sinhala (English words translated to Sinhala)

Examples

Input Output Task
ara politician wa mama dannawa අර දේශපාලකයෝ මම දන්නවා Code-mix
meka harima boring panthiyak මේක හරිම කම්මැලි පන්තියක් Code-mix
oyage phone eka ko ඔයාගේ දුරකථනය කෝ Code-mix
mama gedara yanawa මම ගෙදර යනවා Transliteration
oyata kohomada ඔයාට කොහොමද Transliteration

Evaluation Results

Phonetic Test Set

Metric Score
CER 0.0191
WER 0.0876
Exact Match Accuracy 41.01%

Adhoc Test Set

Metric Score
CER 0.0416
WER 0.1484
Exact Match Accuracy 22.17%

Sample Predictions

Phonetic Test

Input awankawama mata eya mathaka ethi akaraya eyayi namuth eya wikarayaki mama obata ashwaya kerehi wedi elmak nodakwana namuth eya mage wilasithawa nowe
Target අවංකවම මට එය මතක ඇති ආකාරය එයයි නමුත් එය විකාරයකි මම ඔබට අශ්වයා කෙරෙහි වැඩි ඇල්මක් නොදක්වන නමුත් එය මගේ විලාසිතාව නොවේ
Pred අවංකවම මට එය මතක ඇති ආකාරය එයයි නමුත් එය විකාරයකි මම ඔබට අශ්වයා කෙරෙහි වැඩි ඇල්මක් නොදක්වන නමුත් එය මගේ විලාසිතාව නොවේ ✓
Input oba mage aneka yeyi adahas karanne kese ho oba ema wilasithawata andinne mandeyi mama asami
Target ඔබ මගේ අනෙකා යැයි අදහස් කරන්නේ කෙසේ හෝ ඔබ එම විලාසිතාවට අඳින්නේ මන්දැයි මම අසමි
Pred ඔබ මගේ අනෙකා යැයි අදහස් කරන්නේ කෙසේ හෝ ඔබ එම විලාසිතාවට අඳින්නේ මන්දැයි මම අසමි ✓

Adhoc Test

Input kmk nehe modyi wge
Target කමක් නැහැ මෝඩයි වගේ
Pred කමක් නැහැ මෝඩයි වගේ ✓
Input eya wda honda deyi asanna
Target එය වඩා හොඳ දැයි අසන්න
Pred එය වඩා හොඳ දැයි අසන්න ✓

Code-Mix Translation

Input Prediction
ara politician wa mama dannawa අර දේශපාලකයෝ මම දන්නවා
meka harima boring panthiyak මේක හරිම කම්මැලි පන්තියක්
oyage phone eka ko ඔයාගේ දුරකථනය කෝ

Training

Architecture & Method

  • Base model: google/gemma-3-270m-it (273M parameters)
  • Method: Multi-phase LoRA fine-tuning
  • Saved precision: bfloat16

Phase 1–3: Transliteration Foundation

The base transliteration model (Gemma3-Singlish-Sinhala-Merged) was trained in 3 progressive phases on the Swa-bhasha phonetic dataset:

Phase Data Epochs Learning Rate LoRA Config
Phase 1: Foundation 650K phonetic samples + augmentation 2 1e-4 r=64, α=128
Phase 2: Expansion 350K new + adhoc + replay + augmentation 2 5e-5 r=64, α=128
Phase 3: Mastery Adhoc-heavy + phonetic mix + augmentation 2 2e-5 r=64, α=128

Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

After each phase, the LoRA adapter was merged into the base model.

Phase 4: Code-Mix Fine-tuning

A new LoRA adapter was trained on top of the merged Phase 3 model to add code-mix translation capability:

Parameter Value
Training data ~57K samples (77% code-mix, 23% replay)
Code-mix data ~27K unique samples × 2 upsample
Replay data 10K phonetic + 3K adhoc (anti-forgetting)
LoRA rank 32
LoRA alpha 64
LoRA dropout 0.05
Target modules q, k, v, o, gate, up, down projections
Learning rate 3e-5 (cosine decay)
Effective batch size 32
Epochs 3 (with early stopping, patience=5)
Training precision float32 (P100 stability)
Saved precision bfloat16
Hardware NVIDIA Tesla P100 16GB

Anti-Forgetting Strategy

  • Replay buffer: 13K transliteration samples mixed into code-mix training
  • Small LoRA rank (r=32): limits capacity to prevent overwriting base knowledge
  • Low learning rate (3e-5): gentle updates to existing weights

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import warnings, os
warnings.filterwarnings("ignore")

MODEL_ID = "savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda" if torch.cuda.is_available() else "cpu")
model.eval()

PROMPT = "Translate the following code-mixed text into pure Sinhala:\n{input}\nSinhala:"

gen_config = GenerationConfig(
    max_new_tokens=256, num_beams=3, do_sample=False,
    repetition_penalty=1.2, pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id, top_p=None, top_k=None,
)

def convert(text):
    prompt = PROMPT.format(input=text)
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=True).to(model.device)
    in_len = inputs["input_ids"].shape[1]
    with torch.no_grad():
        out = model.generate(**inputs, generation_config=gen_config)
    return tokenizer.decode(out[0, in_len:], skip_special_tokens=True).strip()

# Transliteration
print(convert("mama gedara yanawa"))            

# Code-mix translation
print(convert("ara politician wa mama dannawa"))  
print(convert("meka harima boring panthiyak"))   
print(convert("oyage phone eka ko"))              

Prompt Format

Single unified prompt for both tasks:

"Transliterate the following Singlish text to Sinhala by considering the context:\n{input}\nSinhala:"

Limitations

  • Conjunct consonants: Occasionally produces ්ය instead of ්‍ය (ZWJ variants)
  • Punctuation: May drop or alter commas and periods
  • Rare English words: Very uncommon English words in code-mix may be transliterated phonetically instead of translated
  • Long sequences: Quality may degrade for inputs longer than ~200 tokens

Model Details

Property Value
Parameters 273M
Precision bfloat16
Model size ~536MB
Context length 256 tokens
Languages Sinhala (si), English (en)
License Apache 2.0
Base model google/gemma-3-270m-it

Citation

If you use this model, please cite:

@misc{gunarathna2025codemix,
  title={Gemma3-Singlish-Sinhala-CodeMix: Multi-Phase LoRA Fine-Tuning for Singlish-to-Sinhala Transliteration and Code-Mix Translation},
  author={Gunarathna, Savinu},
  year={2025},
  url={https://huggingface.co/savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix}
}

Related Work

@article{sumanathilaka2025swa,
  title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
  author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
  journal={arXiv preprint arXiv:2507.09245},
  year={2025}
}

@article{ranasinghe2022sold,
  title={SOLD: Sinhala Offensive Language Dataset},
  author={Ranasinghe, Tharindu and Anuradha, Isuri and Premasiri, Damith and Silva, Kanishka and Hettiarachchi, Hansi and Uyangodage, Lasitha and Zampieri, Marcos},
  journal={arXiv preprint arXiv:2212.00851},
  year={2022}
}

@inproceedings{Nsina2024,
  author={Hettiarachchi, Hansi and Premasiri, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu},
  title={{NSINA: A News Corpus for Sinhala}},
  booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  year={2024},
  month={May},
}

Acknowledgments

  • Google for the Gemma model family
  • Swa-bhasha Resource Hub for the phonetic transliteration dataset
  • Hugging Face for the transformers ecosystem
  • Training compute provided by Kaggle (P100 GPU)
Downloads last month
153
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix

Adapter
(45)
this model
Adapters
1 model

Space using savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix 1

Papers for savinugunarathna/Gemma3-Singlish-Sinhala-CodeMix

Evaluation results