MARBERT Model for Arabic Sentiment Analysis (Positive/Negative)
This is a fine-tuned version of UBC-NLP/MARBERTv2 for Arabic Sentiment Analysis.
The model is trained to classify Arabic text (specifically tweets) into two categories: Positive (LABEL_1) or Negative (LABEL_0).
๐ Live Demo
You can test the model live on the Hugging Face Space: https://huggingface.co/spaces/iMeshal/arabic-sentiment-app
๐ Model Performance
The model was trained on 80% of the training data and validated on 20%. The final evaluation was performed on a separate, unseen test set.
Final Test Set Results (Accuracy: 94.40%)
| Metric | Score |
|---|---|
| Accuracy | 94.40% |
| F1 (Macro) | 94.40% |
| Precision (Macro) | 94.40% |
| Recall (Macro) | 94.40% |
| Loss | 0.1667 |
The model achieved its best validation accuracy of 93.4% at Epoch 2, and load_best_model_at_end was used.
๐ป Intended Use (How to use)
You can use this model directly with the transformers pipeline.
from transformers import pipeline
# Load the pipeline
pipe = pipeline(
"sentiment-analysis",
model="iMeshal/arabic-sentiment-classifier-marbert"
)
# Test with new texts
texts = [
"ูุฐุง ุงูู
ูุชุฌ ุฑุงุฆุน ุฌุฏุงู ุฃูุตุญ ุจู",
"ุฃุณูุฃ ุฎุฏู
ุฉ ุนู
ูุงุก ุนูู ุงูุฅุทูุงู",
"ุงูุฌู ุงูููู
ุฌู
ูู"
]
results = pipe(texts)
print(results)
# Output:
# [
# {'label': 'LABEL_1', 'score': 0.99...}, # Positive
# {'label': 'LABEL_0', 'score': 0.99...}, # Negative
# {'label': 'LABEL_1', 'score': 0.98...} # Positive
# ]
๐ Training Data
The model was trained on the Arabic Sentiment Twitter Corpus dataset from Kaggle.
- Preprocessing: Long/concatenated tweets (which appeared to be noise) were cleaned.
- Training Set: ~24,163 samples.
- Validation Set: ~6,041 samples.
- Test Set: ~11,508 samples.
- Balance: All datasets were perfectly balanced (approx. 50% Positive / 50% Negative).
โ๏ธ Training Procedure
The model was trained using the transformers.Trainer class with the following key hyperparameters:
- Framework: PyTorch
- Base Model:
UBC-NLP/MARBERTv2 - Epochs: 3 (with Early Stopping)
- Early Stopping: Patience set to 2 (training stopped at Epoch 3, but Epoch 2 was the best).
- Batch Size: 16
- Learning Rate: 2e-5
- Tokenizer:
AutoTokenizer(withpadding="max_length",truncation=True,max_length=512)
๐ Contact
- Name: Meshal AL-Qushaym
- Email: meshalqushim@outlook.com
- Kaggle: kaggle.com/meshalfalah
- Downloads last month
- 46