library_name: transformers base_model: UBC-NLP/AraT5v2-base-1024 tags: - arabic - darija - sentiment-analysis - text-classification - tashkeel - arabict5

AraT5v2-Darja-Sentiment

Fine-tuned version of UBC-NLP/AraT5v2-base-1024 for sentiment analysis of texts written in Algerian Arabic (Darja), with or without Tashkīl.

Dataset

The model was trained on a custom dataset containing:

tweet: the original short text in Algerian Arabic
text_catt: the same text with Tashkīl (diacritics) added
label: one of positive, neutral, negative

The input format used during training: sentiment: [Darja]: [Tashkīl]: <TEXTE_TASHKĪL>

The dataset is from the SemEval_Task12 arq.

python

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("Noanihio/arat5v2-darja-sentiment") tokenizer = AutoTokenizer.from_pretrained("Noanihio/arat5v2-darja-sentiment")

input_text = "sentiment: [Darja]: والله غير كي شفتو فرحت [Tashkīl]: وَاللَّهِ غَيْرُ كَيْ شَفْتُهُ فَرِحْتُ" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) label = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(label) # ➜ positive

Training details

Model: UBC-NLP/AraT5v2-base-1024

Trained on: Google Colab Pro, GPU T4

Epochs: 3

Batch size: 8

Learning rate: 5e-5

Framework: transformers.Trainer, full fine-tuning

No LoRA used

Intended Use

This model is designed for:

Automatic sentiment classification in Arabic dialects

Evaluating emotional tone in Darja tweets and messages

Research in NLP for underrepresented languages (Algerian Arabic)

Limitations

Model may be biased toward informal/digital Darja

Limited generalization to other Arabic dialects

Tashkīl input can improve results, but is optional

Acknowledgements

Fine-tuned by @Noanihio with the help Faiza Belbachir

Downloads last month: 1

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for Noanihio/arat5v2-darja-sentiment

Base model

UBC-NLP/AraT5v2-base-1024

Finetuned

(21)

this model