visolex
/

xlm-r-spam-binary

+---
+license: apache-2.0
+base_model: xlm-roberta-large
+tags:
+- vietnamese
+- spam-detection
+- text-classification
+- e-commerce
+datasets:
+- ViSpamReviews
+metrics:
+- accuracy
+- macro-f1
+- macro-precision
+- macro-recall
+model-index:
+- name: xlm-r-spam-binary
+  results:
+  - task:
+      type: text-classification
+      name: Spam Review Detection
+    dataset:
+      name: ViSpamReviews
+      type: ViSpamReviews
+    metrics:
+      - type: accuracy
+        value: 0.9020
+      - type: macro-f1
+        value: 0.8763
+---
+# xlm-r-spam-binary: Spam Review Detection for Vietnamese Text
+This model is a fine-tuned version of [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) on the **ViSpamReviews** dataset for spam review detection in Vietnamese e-commerce reviews.
+## Model Details
+* **Base Model**: `xlm-roberta-large`
+* **Description**: XLM-RoBERTa Large - Multilingual model
+* **Dataset**: ViSpamReviews (Vietnamese Spam Review Dataset)
+* **Fine-tuning Framework**: HuggingFace Transformers
+* **Task**: Spam Review Detection (binary)
+* **Number of Classes**: 2
+### Hyperparameters
+* Max sequence length: `256`
+* Learning rate: `5e-5`
+* Batch size: `32`
+* Epochs: `100`
+* Early stopping patience: `5`
+## Dataset
+The model was trained on the **ViSpamReviews** dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes:
+* **Train set**: 14,299 samples (72%)
+* **Validation set**: 1,590 samples (8%)
+* **Test set**: 3,971 samples (20%)
+### Label Distribution
+* **Non-spam** (0): Genuine product reviews
+* **Spam** (1): Fake or promotional reviews
+## Results
+The model was evaluated on the test set with the following metrics:
+* **Accuracy**: `0.9020`
+* **Macro-F1**: `0.8763`
+## Usage
+You can use this model for spam review detection in Vietnamese text. Below is an example:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "visolex/xlm-r-spam-binary"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Example review text
+text = "Sản phẩm này rất tốt, shop giao hàng nhanh!"
+# Tokenize
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
+# Predict
+with torch.no_grad():
+    outputs = model(**inputs)
+    predicted_class = outputs.logits.argmax(dim=-1).item()
+    probabilities = torch.softmax(outputs.logits, dim=-1)
+# Map to label
+label_map = {0: "Non-spam", 1: "Spam"}
+predicted_label = label_map[predicted_class]
+confidence = probabilities[0][predicted_class].item()
+print(f"Text: {text}")
+print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})")
+```
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{{
+  {model_key}_spam_detection,
+  title={{{description}}},
+  author={{ViSoLex Team}},
+  year={{2025}},
+  howpublished={{\url{{https://huggingface.co/{visolex/xlm-r-spam-binary}}}}}
+}}
+```
+## License
+This model is released under the Apache-2.0 license.
+## Acknowledgments
+* Base model: [{base_model}](https://huggingface.co/{base_model})
+* Dataset: ViSpamReviews (Vietnamese Spam Review Dataset)
+* ViSoLex Toolkit for Vietnamese NLP