SerendipLLM V2 🇱🇰
The largest Sinhala instruction-following language model trained on 309,328 examples
SerendipLLM V2 is a specialized Sinhala language model with exceptional capabilities in news classification, question answering, and general Sinhala text generation. Built on Llama-3-8B with continued pre-training and instruction fine-tuning, it represents a significant advancement in Sinhala NLP.
🏆 Key Achievements
- ✅ 6.2x larger dataset than existing Sinhala models (309K vs ~50K examples)
- ✅ 45,080 news classification examples for specialized Sinhala news categorization
- ✅ 50% training loss reduction (0.54 → 0.27) over 3 epochs
- ✅ Comprehensive training on diverse Sinhala tasks
- ✅ Open-source - Complete pipeline and dataset available
📊 Model Details
| Attribute | Value |
|---|---|
| Base Model | Meta Llama-3-8B |
| CPT Foundation | serendib-llm-cpt-llama3-8b |
| Parameters | 8.16B total, 130M trainable (1.59%) |
| Training Examples | 309,328 |
| Training Method | LoRA fine-tuning |
| Training Duration | 26.5 hours on A100 80GB |
| Final Loss | 0.27 |
| License | Apache 2.0 |
🎯 Specialized Capabilities
News Classification (Our Strength!)
Trained on 45,080 Sinhala news examples - the largest news classification dataset for Sinhala.
Example:
Input: "ශ්රී ලංකා ක්රිකට් කණ්ඩායම අද ඉන්දියාවට එරෙහිව තරගයක් ආරම්භ කළේය"
Output: "මෙය ක්රීඩා පුවතකි" ✅
Question Answering
29,390 QA pairs covering geography, history, culture, and general knowledge.
Example:
Input: "ශ්රී ලංකාවේ අගනුවර කුමක්ද?"
Output: "ශ්රී ලංකාවේ අගනුවර කොළඹයි" ✅
📈 Dataset Composition
| Category | Examples | Percentage |
|---|---|---|
| General Sinhala | 205,403 | 66.4% |
| News Classification | 45,080 | 14.6% |
| QA Pairs | 29,390 | 9.5% |
| Summarization | 19,593 | 6.3% |
| Rewrite/Formatting | 9,862 | 3.2% |
| TOTAL | 309,328 | 100% |
🚀 Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model
model = AutoModelForCausalLM.from_pretrained(
"Chamaka8/Serendip-LLM-CPT-SFT-v2",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Chamaka8/Serendip-LLM-CPT-SFT-v2")
# Format prompt
prompt = """### Instruction:
පහත පුවත් ලිපිය වර්ගීකරණය කරන්න
### Input:
ශ්රී ලංකා ක්රිකට් කණ්ඩායම අද ඉන්දියාවට එරෙහිව තරගයක් ආරම්භ කළේය.
### Response:
"""
# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=150,
temperature=0.7,
top_p=0.9
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("### Response:")[-1].strip())
⚙️ Training Configuration
Hardware
- GPU: NVIDIA A100 SXM 80GB
- Training Time: 26.5 hours
- Cost: ~$37 USD
Hyperparameters
num_train_epochs = 3
per_device_train_batch_size = 8
gradient_accumulation_steps = 4
learning_rate = 2e-5
max_seq_length = 384
lora_r = 64
lora_alpha = 128
Training Loss
| Epoch | Loss |
|---|---|
| 1.0 | 0.28 |
| 2.0 | 0.24 |
| 3.0 | 0.27 |
📊 Comparison
| Model | Training Data | News Examples |
|---|---|---|
| SinLlama | ~50,000 | Limited |
| SerendipLLM V2 | 309,328 | 45,080 ✅ |
🔗 Resources
- Dataset: Serendip-sft-sinhala
- Base CPT: serendib-llm-cpt-llama3-8b
- Training Script: See
training_scripts/folder
📚 Citation
@model{serendipllm2026,
title={SerendipLLM V2: Large-Scale Instruction-Tuning for Sinhala},
author={Chamaka Alwis},
year={2026},
url={https://huggingface.co/Chamaka8/Serendip-LLM-CPT-SFT-v2}
}
📄 License
Apache 2.0
Built with ❤️ for the Sinhala NLP community
- Downloads last month
- 35
Model tree for Chamaka8/Serendip-LLM-CPT-SFT-v2
Base model
Chamaka8/serendib-llm-cpt-llama3-8b