library_name: transformers base_model: UBC-NLP/AraT5v2-base-1024 tags: - arabic - darija - sentiment-analysis - text-classification - tashkeel - arabict5
AraT5v2-Darja-Sentiment
Fine-tuned version of UBC-NLP/AraT5v2-base-1024 for sentiment analysis of texts written in Algerian Arabic (Darja), with or without Tashkīl.
Dataset
The model was trained on a custom dataset containing:
tweet: the original short text in Algerian Arabictext_catt: the same text with Tashkīl (diacritics) addedlabel: one ofpositive,neutral,negative
The input format used during training: sentiment: [Darja]: [Tashkīl]: <TEXTE_TASHKĪL>
The dataset is from the SemEval_Task12 arq.
python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("Noanihio/arat5v2-darja-sentiment") tokenizer = AutoTokenizer.from_pretrained("Noanihio/arat5v2-darja-sentiment")
input_text = "sentiment: [Darja]: والله غير كي شفتو فرحت [Tashkīl]: وَاللَّهِ غَيْرُ كَيْ شَفْتُهُ فَرِحْتُ" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) label = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(label) # ➜ positive
Training details
Model: UBC-NLP/AraT5v2-base-1024
Trained on: Google Colab Pro, GPU T4
Epochs: 3
Batch size: 8
Learning rate: 5e-5
Framework: transformers.Trainer, full fine-tuning
No LoRA used
Intended Use
This model is designed for:
Automatic sentiment classification in Arabic dialects
Evaluating emotional tone in Darja tweets and messages
Research in NLP for underrepresented languages (Algerian Arabic)
Limitations
Model may be biased toward informal/digital Darja
Limited generalization to other Arabic dialects
Tashkīl input can improve results, but is optional
Acknowledgements
Fine-tuned by @Noanihio with the help Faiza Belbachir
- Downloads last month
- 1
Model tree for Noanihio/arat5v2-darja-sentiment
Base model
UBC-NLP/AraT5v2-base-1024