--- library_name: transformers base_model: - UBC-NLP/AraT5v2-base-1024 pipeline_tag: text-classification --- --- library_name: transformers base_model: UBC-NLP/AraT5v2-base-1024 tags: - arabic - darija - sentiment-analysis - text-classification - tashkeel - arabict5 --- # AraT5v2-Darja-Sentiment Fine-tuned version of [`UBC-NLP/AraT5v2-base-1024`](https://huggingface.co/UBC-NLP/AraT5v2-base-1024) for **sentiment analysis** of texts written in **Algerian Arabic (Darja)**, with or without **Tashkīl**. --- ## Dataset The model was trained on a custom dataset containing: - `tweet`: the original short text in Algerian Arabic - `text_catt`: the same text with Tashkīl (diacritics) added - `label`: one of `positive`, `neutral`, `negative` > The input format used during training: sentiment: [Darja]: [Tashkīl]: The dataset is from the SemEval_Task12 arq. ## python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("Noanihio/arat5v2-darja-sentiment") tokenizer = AutoTokenizer.from_pretrained("Noanihio/arat5v2-darja-sentiment") input_text = "sentiment: [Darja]: والله غير كي شفتو فرحت [Tashkīl]: وَاللَّهِ غَيْرُ كَيْ شَفْتُهُ فَرِحْتُ" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) label = tokenizer.decode(outputs[0], skip_special_tokens=True) print(label) # ➜ positive ## Training details Model: UBC-NLP/AraT5v2-base-1024 Trained on: Google Colab Pro, GPU T4 Epochs: 3 Batch size: 8 Learning rate: 5e-5 Framework: transformers.Trainer, full fine-tuning No LoRA used ## Intended Use This model is designed for: Automatic sentiment classification in Arabic dialects Evaluating emotional tone in Darja tweets and messages Research in NLP for underrepresented languages (Algerian Arabic) ## Limitations Model may be biased toward informal/digital Darja Limited generalization to other Arabic dialects Tashkīl input can improve results, but is optional ## Acknowledgements Fine-tuned by @Noanihio with the help Faiza Belbachir