SerendipLLM V2 🇱🇰

The largest Sinhala instruction-following language model trained on 309,328 examples

SerendipLLM V2 is a specialized Sinhala language model with exceptional capabilities in news classification, question answering, and general Sinhala text generation. Built on Llama-3-8B with continued pre-training and instruction fine-tuning, it represents a significant advancement in Sinhala NLP.

🏆 Key Achievements

✅ 6.2x larger dataset than existing Sinhala models (309K vs ~50K examples)
✅ 45,080 news classification examples for specialized Sinhala news categorization
✅ 50% training loss reduction (0.54 → 0.27) over 3 epochs
✅ Comprehensive training on diverse Sinhala tasks
✅ Open-source - Complete pipeline and dataset available

📊 Model Details

Attribute	Value
Base Model	Meta Llama-3-8B
CPT Foundation	serendib-llm-cpt-llama3-8b
Parameters	8.16B total, 130M trainable (1.59%)
Training Examples	309,328
Training Method	LoRA fine-tuning
Training Duration	26.5 hours on A100 80GB
Final Loss	0.27
License	Apache 2.0

🎯 Specialized Capabilities

News Classification (Our Strength!)

Trained on 45,080 Sinhala news examples - the largest news classification dataset for Sinhala.

Example:

Input: "ශ්‍රී ලංකා ක්‍රිකට් කණ්ඩායම අද ඉන්දියාවට එරෙහිව තරගයක් ආරම්භ කළේය"
Output: "මෙය ක්‍රීඩා පුවතකි" ✅

Question Answering

29,390 QA pairs covering geography, history, culture, and general knowledge.

Example:

Input: "ශ්‍රී ලංකාවේ අගනුවර කුමක්ද?"
Output: "ශ්‍රී ලංකාවේ අගනුවර කොළඹයි" ✅

📈 Dataset Composition

Category	Examples	Percentage
General Sinhala	205,403	66.4%
News Classification	45,080	14.6%
QA Pairs	29,390	9.5%
Summarization	19,593	6.3%
Rewrite/Formatting	9,862	3.2%
TOTAL	309,328	100%

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Chamaka8/Serendip-LLM-CPT-SFT-v2",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Chamaka8/Serendip-LLM-CPT-SFT-v2")

# Format prompt
prompt = """### Instruction:
පහත පුවත් ලිපිය වර්ගීකරණය කරන්න

### Input:
ශ්‍රී ලංකා ක්‍රිකට් කණ්ඩායම අද ඉන්දියාවට එරෙහිව තරගයක් ආරම්භ කළේය.

### Response:
"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=150,
    temperature=0.7,
    top_p=0.9
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("### Response:")[-1].strip())

⚙️ Training Configuration

Hardware

GPU: NVIDIA A100 SXM 80GB
Training Time: 26.5 hours
Cost: ~$37 USD

Hyperparameters

num_train_epochs = 3
per_device_train_batch_size = 8
gradient_accumulation_steps = 4
learning_rate = 2e-5
max_seq_length = 384
lora_r = 64
lora_alpha = 128

Training Loss

Epoch	Loss
1.0	0.28
2.0	0.24
3.0	0.27

📊 Comparison

Model	Training Data	News Examples
SinLlama	~50,000	Limited
SerendipLLM V2	309,328	45,080 ✅

🔗 Resources

Dataset: Serendip-sft-sinhala
Base CPT: serendib-llm-cpt-llama3-8b
Training Script: See training_scripts/ folder

📚 Citation

@model{serendipllm2026,
  title={SerendipLLM V2: Large-Scale Instruction-Tuning for Sinhala},
  author={Chamaka Alwis},
  year={2026},
  url={https://huggingface.co/Chamaka8/Serendip-LLM-CPT-SFT-v2}
}

📄 License

Apache 2.0

Built with ❤️ for the Sinhala NLP community

Downloads last month: 35

Safetensors

Model size

8B params

Tensor type

F16

Model tree for Chamaka8/Serendip-LLM-CPT-SFT-v2

Base model

Chamaka8/serendib-llm-cpt-llama3-8b

Finetuned

(1)

this model