Spam Detection for Social Media Text
Multilingual Indonesian & English | XLM-RoBERTa
This model is a fine-tuned XLM-RoBERTa designed to detect Spam vs Ham content in social media text.
It supports Indonesian regional languages, Malay, and English, making it suitable for multi-platform moderation use cases such as Twitter/X, Instagram, TikTok, Facebook, and online forums.
✨ Key Features
- ✅ Spam vs Ham classification
- 🌏 Multilingual support (Indonesian & English + regional languages)
- 🧠 Based on XLM-RoBERTa (multilingual transformer)
- ⚡ Ready-to-use with Hugging Face
pipeline - 📊 Strong performance on noisy social media text
🌍 Supported Languages
- 🇮🇩 Bahasa Indonesia
- Bahasa Melayu
- Basa Jawa
- Bahasa daerah Indonesia (Aceh, Banjar, Bugis, Minang, Sunda, dll.)
- 🇬🇧 English
🧪 Model Performance
| Metric | Score |
|---|---|
| Accuracy | 0.9451 |
| F1 (Macro) | 0.9446 |
| F1 (Weighted) | 0.9500 |
| Precision | 0.9500 |
| Recall | 0.9500 |
| Training Loss | 0.1187 |
| Validation Loss | 0.2370 |
Evaluated on held-out validation data with balanced spam/ham distribution.
🚀 Quick Start
Installation
pip install transformers torch
Single Prediction
from transformers import pipeline
classifier = pipeline(
task="text-classification",
model="nahiar/spam-detection-xlm-roberta-v1"
)
result = classifier("PASTI DIJAMIN WDP 100%")
print(result)
Output
[{'label': 'LABEL_1', 'score': 0.9876}]
Label Mapping
LABEL_0 → SPAM
LABEL_1 → HAM
📦 Batch Inference Example
"texts": [
"साइबर हमले के बाद JLR का बड़ा बयान - जानें कंपनी ने क्या कहा | Tata Motors के शेयर पर दिखेगा असर?
#TataMotors #JLR #CyberAttack
https://t.co/6WlGS77UUp",
"Kita sudah Ready skrg ini bagi yang memerlukan jasa pemulihan akun & Hapus All akun
Lacak lokasi / sadap wa / Hack Akun / Revengeporn - korban pemerasan vcs / terror
TIKTOK,GMAIL,TWITER,TELEGRAM,
FACEBOOK,INSTAGRAM
#revengeporn #zonauangᅠᅠᅠ
☎️ https://t.co/K0AbW08qnU https://t.co/4IpWNA7a0z",
"💥Slot Gacor Hari ini Rute303
💥Jaminan Jackpot Maxwin malam ini
LINK SLOT GACOR HARI INI : https://t.co/QvxjCAnt8o
Tags:
Jumbo #timsekop Jumat gratis ongkir Like Crazy PSIM https://t.co/ukuRdlvgGA"
]
results = classifier(texts)
for text, result in zip(texts, results):
print(f"{text} -> {result['label']} ({result['score']:.4f})")
🏗️ Training Configuration
| Parameter | Value |
|---|---|
| Base Model | xlm-roberta-base |
| Training Samples | 11,958 |
| Validation Samples | 2,989 |
| Epochs | 3 |
| Learning Rate | 2e-5 |
| Batch Size | 16 |
| Training Date | 2025-12-15 |
🎯 Intended Use Cases
- Social media spam moderation
- Comment & post filtering
- Content quality control
- Pre-filtering for sentiment or topic analysis pipelines
⚠️ Limitations
- Binary classification only (Spam / Ham)
- Not optimized for non-social-media formal text
- Performance may degrade on very short or ambiguous messages
📜 License
Released under the Apache 2.0 License. Free for commercial and research use.
📚 Citation
If you use this model in your work, please cite:
@misc{djunaedi2025spam,
author = {Raihan Hidayatullah Djunaedi},
title = {Spam Detection for Social Media Text},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/nahiar/spam-detection-xlm-roberta-v1}
}
🙌 Acknowledgements
- Hugging Face Transformers
- Facebook AI Research — XLM-RoBERTa
- Downloads last month
- 38
Model tree for nahiar/spam-detection-xlm-roberta-v1
Base model
FacebookAI/xlm-roberta-base