SerendipLLM V2 🇱🇰

The largest Sinhala instruction-following language model trained on 309,328 examples

SerendipLLM V2 is a specialized Sinhala language model with exceptional capabilities in news classification, question answering, and general Sinhala text generation. Built on Llama-3-8B with continued pre-training and instruction fine-tuning, it represents a significant advancement in Sinhala NLP.

🏆 Key Achievements

  • 6.2x larger dataset than existing Sinhala models (309K vs ~50K examples)
  • 45,080 news classification examples for specialized Sinhala news categorization
  • 50% training loss reduction (0.54 → 0.27) over 3 epochs
  • Comprehensive training on diverse Sinhala tasks
  • Open-source - Complete pipeline and dataset available

📊 Model Details

Attribute Value
Base Model Meta Llama-3-8B
CPT Foundation serendib-llm-cpt-llama3-8b
Parameters 8.16B total, 130M trainable (1.59%)
Training Examples 309,328
Training Method LoRA fine-tuning
Training Duration 26.5 hours on A100 80GB
Final Loss 0.27
License Apache 2.0

🎯 Specialized Capabilities

News Classification (Our Strength!)

Trained on 45,080 Sinhala news examples - the largest news classification dataset for Sinhala.

Example:

Input: "ශ්‍රී ලංකා ක්‍රිකට් කණ්ඩායම අද ඉන්දියාවට එරෙහිව තරගයක් ආරම්භ කළේය"
Output: "මෙය ක්‍රීඩා පුවතකි"

Question Answering

29,390 QA pairs covering geography, history, culture, and general knowledge.

Example:

Input: "ශ්‍රී ලංකාවේ අගනුවර කුමක්ද?"
Output: "ශ්‍රී ලංකාවේ අගනුවර කොළඹයි"

📈 Dataset Composition

Category Examples Percentage
General Sinhala 205,403 66.4%
News Classification 45,080 14.6%
QA Pairs 29,390 9.5%
Summarization 19,593 6.3%
Rewrite/Formatting 9,862 3.2%
TOTAL 309,328 100%

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Chamaka8/Serendip-LLM-CPT-SFT-v2",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Chamaka8/Serendip-LLM-CPT-SFT-v2")

# Format prompt
prompt = """### Instruction:
පහත පුවත් ලිපිය වර්ගීකරණය කරන්න

### Input:
ශ්‍රී ලංකා ක්‍රිකට් කණ්ඩායම අද ඉන්දියාවට එරෙහිව තරගයක් ආරම්භ කළේය.

### Response:
"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=150,
    temperature=0.7,
    top_p=0.9
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("### Response:")[-1].strip())

⚙️ Training Configuration

Hardware

  • GPU: NVIDIA A100 SXM 80GB
  • Training Time: 26.5 hours
  • Cost: ~$37 USD

Hyperparameters

num_train_epochs = 3
per_device_train_batch_size = 8
gradient_accumulation_steps = 4
learning_rate = 2e-5
max_seq_length = 384
lora_r = 64
lora_alpha = 128

Training Loss

Epoch Loss
1.0 0.28
2.0 0.24
3.0 0.27

📊 Comparison

Model Training Data News Examples
SinLlama ~50,000 Limited
SerendipLLM V2 309,328 45,080

🔗 Resources

📚 Citation

@model{serendipllm2026,
  title={SerendipLLM V2: Large-Scale Instruction-Tuning for Sinhala},
  author={Chamaka Alwis},
  year={2026},
  url={https://huggingface.co/Chamaka8/Serendip-LLM-CPT-SFT-v2}
}

📄 License

Apache 2.0


Built with ❤️ for the Sinhala NLP community

Downloads last month
35
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chamaka8/Serendip-LLM-CPT-SFT-v2

Finetuned
(1)
this model