AnnyNguyen commited on
Commit
34d86b2
·
verified ·
1 Parent(s): 01e748a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: xlm-roberta-large
4
+ tags:
5
+ - vietnamese
6
+ - spam-detection
7
+ - text-classification
8
+ - e-commerce
9
+ datasets:
10
+ - ViSpamReviews
11
+ metrics:
12
+ - accuracy
13
+ - macro-f1
14
+ - macro-precision
15
+ - macro-recall
16
+ model-index:
17
+ - name: xlm-r-spam-binary
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Spam Review Detection
22
+ dataset:
23
+ name: ViSpamReviews
24
+ type: ViSpamReviews
25
+ metrics:
26
+ - type: accuracy
27
+ value: 0.9020
28
+ - type: macro-f1
29
+ value: 0.8763
30
+ ---
31
+ # xlm-r-spam-binary: Spam Review Detection for Vietnamese Text
32
+
33
+ This model is a fine-tuned version of [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) on the **ViSpamReviews** dataset for spam review detection in Vietnamese e-commerce reviews.
34
+
35
+ ## Model Details
36
+
37
+ * **Base Model**: `xlm-roberta-large`
38
+ * **Description**: XLM-RoBERTa Large - Multilingual model
39
+ * **Dataset**: ViSpamReviews (Vietnamese Spam Review Dataset)
40
+ * **Fine-tuning Framework**: HuggingFace Transformers
41
+ * **Task**: Spam Review Detection (binary)
42
+ * **Number of Classes**: 2
43
+
44
+ ### Hyperparameters
45
+
46
+ * Max sequence length: `256`
47
+ * Learning rate: `5e-5`
48
+ * Batch size: `32`
49
+ * Epochs: `100`
50
+ * Early stopping patience: `5`
51
+
52
+ ## Dataset
53
+
54
+ The model was trained on the **ViSpamReviews** dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes:
55
+
56
+ * **Train set**: 14,299 samples (72%)
57
+ * **Validation set**: 1,590 samples (8%)
58
+ * **Test set**: 3,971 samples (20%)
59
+
60
+ ### Label Distribution
61
+
62
+
63
+ * **Non-spam** (0): Genuine product reviews
64
+ * **Spam** (1): Fake or promotional reviews
65
+
66
+ ## Results
67
+
68
+ The model was evaluated on the test set with the following metrics:
69
+
70
+ * **Accuracy**: `0.9020`
71
+ * **Macro-F1**: `0.8763`
72
+
73
+
74
+ ## Usage
75
+
76
+ You can use this model for spam review detection in Vietnamese text. Below is an example:
77
+
78
+ ```python
79
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
80
+ import torch
81
+
82
+ # Load model and tokenizer
83
+ model_name = "visolex/xlm-r-spam-binary"
84
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
85
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
86
+
87
+ # Example review text
88
+ text = "Sản phẩm này rất tốt, shop giao hàng nhanh!"
89
+
90
+ # Tokenize
91
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
92
+
93
+ # Predict
94
+ with torch.no_grad():
95
+ outputs = model(**inputs)
96
+ predicted_class = outputs.logits.argmax(dim=-1).item()
97
+ probabilities = torch.softmax(outputs.logits, dim=-1)
98
+
99
+
100
+ # Map to label
101
+ label_map = {0: "Non-spam", 1: "Spam"}
102
+ predicted_label = label_map[predicted_class]
103
+ confidence = probabilities[0][predicted_class].item()
104
+
105
+ print(f"Text: {text}")
106
+ print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})")
107
+
108
+ ```
109
+
110
+ ## Citation
111
+
112
+ If you use this model, please cite:
113
+
114
+ ```bibtex
115
+ @misc{{
116
+ {model_key}_spam_detection,
117
+ title={{{description}}},
118
+ author={{ViSoLex Team}},
119
+ year={{2025}},
120
+ howpublished={{\url{{https://huggingface.co/{visolex/xlm-r-spam-binary}}}}}
121
+ }}
122
+ ```
123
+
124
+ ## License
125
+
126
+ This model is released under the Apache-2.0 license.
127
+
128
+ ## Acknowledgments
129
+
130
+ * Base model: [{base_model}](https://huggingface.co/{base_model})
131
+ * Dataset: ViSpamReviews (Vietnamese Spam Review Dataset)
132
+ * ViSoLex Toolkit for Vietnamese NLP