SinLlama-Singlis-Sentiment-Analysis
This model is a fine-tuned version of SinLlama-8B optimized for Sentiment Analysis on Romanized Sinhala-English code-mixed text (commonly known as Singlish).
Model Details
Model Description
This model was developed to address the linguistic challenges of social media text in Sri Lanka, where users frequently mix Sinhala and English using the Roman alphabet. It utilizes a decoder-only architecture to better capture sequential dependencies and semantic nuances in noisy, informal text compared to traditional statistical models.
- Developed by: V.S. Abeynayake (University of Ruhuna)
- Model type: Decoder-only Large Language Model (LLM)
- Language(s) (NLP): Romanized Sinhala (Singlish) and English
- Finetuned from model: polyglots/SinLlama_v01
- Task: 3-way Sentiment Classification (Positive, Negative, Neutral)
Model Sources
- Repository: Vihanga445/sinllama-singlis-sentiment-analysis
- Thesis: ENHANCING SENTIMENT ANALYSIS FOR ROMANIZED SINHALA-ENGLISH CODE-MIXED SOCIAL MEDIA TEXT
Uses
Direct Use
The model is intended for classifying the sentiment of social media comments, product reviews, and public feedback written in Singlish or code-mixed Sinhala-English.
Out-of-Scope Use
The model is not designed for formal Sinhala literature or technical English documents. It may not perform reliably on languages other than Sinhala and English.
Bias, Risks, and Limitations
- Dataset Bias: The training data is derived from YouTube comments, reflecting a highly informal and domain-specific communication style.
- Class Imbalance: Despite oversampling, the model may still show a slight bias toward the "Positive" class due to its dominance in real-world social media behavior.
- Standardization: Since Singlish has no standardized spelling, extreme phonetic variations not seen in training may affect accuracy.
How to Get Started with the Model
Use the following code to load the model and run clean inference. We use specific termination tokens to prevent the model from rambling or generating extra examples.
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
from peft import PeftModel
# The Hugging Face model ID for the fine-tuned adapter
hf_model_id = "Vihanga445/sinllama-singlis-sentiment-analysis"
base_model_name = "polyglots/SinLlama_v01"
# 1. Load Tokenizer and Base Model
tokenizer = AutoTokenizer.from_pretrained(hf_model_id)
model, _ = FastLanguageModel.from_pretrained(
model_name = base_model_name,
max_seq_length = 2048,
dtype = torch.bfloat16,
load_in_4bit = True,
resize_model_vocab = 139336,
)
# 2. Attach Adapters and Prep for Inference
model = PeftModel.from_pretrained(model, hf_model_id)
FastLanguageModel.for_inference(model)
# 3. Define the prompt and stopping criteria
prompt = """### Instruction:
Analyze the sentiment of the comment enclosed in square brackets, determine if it is positive, neutral, or negative, and return the answer as the corresponding sentiment label "Pos" or "Neu" or "Neg".
### Input:
[awulak na]
### Response:
"""
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
# Define Llama-3 termination tokens
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
# 4. Generate only the label
outputs = model.generate(
**inputs,
max_new_tokens = 64,
eos_token_id = terminators,
pad_token_id = tokenizer.eos_token_id,
do_sample = False
)
# 5. Clean and Display Output
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
final_answer = decoded.split("### Response:\n")[-1].strip().split("\n")[0]
print(f"Sentiment: {final_answer}")