LexCAT: Lexicon-Enhanced Code-Switched Attention Transformer for Tagalog–English Sentiment Analysis

Author: Glenn Marcus D. Cinco
Institution: Mapúa University
Thesis Title: LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik
Degree: BS & MS in Computer Science
Date: July 2025


🎓 Abstract

LexCAT is a fine-tuned XLM-RoBERTa model enhanced with LexiLiksik, a cross-lingual, context-aware Tagalog–English sentiment lexicon developed to handle intra-sentential sentiment shifts in code-switched Taglish text. LexCAT achieves 84.31% validation accuracy on the FiReCS dataset, significantly outperforming monolingual baselines (FilCon: 73.12%, SentiWordNet: 69.52%).

This model is the final output of a thesis that systematically addresses gaps in code-switched sentiment analysis through lexicon development, attention biasing, and real-world fine-tuning.


🔧 Model Architecture

  • Base Model: xlm-roberta-base
  • Enhancement: Integrated LexiLiksik via:
    • Soft constraints during fine-tuning
    • Attention weight adjustment for sentiment-relevant tokens
    • Metadata-aware pooling (POS, code-switching type)
  • Fine-tuning Dataset: FiReCS (10,487 Filipino–English code-switched reviews)
  • Evaluation Dataset: SentiTaglish Products and Services + FiReCS Test Set

📊 Performance

Metric Value
Accuracy 84.31%
F1-Score 0.8566
Precision 0.8353
Recall 0.8574
Cohen’s κ 0.83 (from FiReCS annotators)

Key Strength: Correctly classifies contrastive phrases like “Maganda pero expensive” as negative.


📥 How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "your-hf-username/LexCAT-LexiLiksik-Final"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "sobrang lambot ng burger pero expensive tlga"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=1).item()

# Class: 0 = Negative, 1 = Neutral, 2 = Positive
sentiment = ["Negative", "Neutral", "Positive"][predicted_class]
print(f"Predicted Sentiment: {sentiment}")

📚 Citation

@mastersthesis{cinco2025lexcat,
  author = {Cinco, Glenn Marcus D.},
  title = {LexCAT: A Lexicon-Based Approach for Code-Switching Analysis with Transformers Using XLM-RoBERTa and LexiLiksik},
  school = {Mapúa University},
  year = {2025},
  month = {July}
}
Downloads last month
2
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GMCTech/LexCAT

Quantizations
1 model

Space using GMCTech/LexCAT 1