Alidiamond/somali-sentiment-analysis
π Model Description
Model Name: Alidiamond/somali-sentiment-analysis
This is a fine-tuned version of castorini/afriberta_base specifically optimized for Somali language sentiment analysis. The model can classify Somali text into positive or negative sentiment categories.
Base Model: castorini/afriberta_base - A multilingual African language model pretrained on 11 African languages including Somali.
Fine-tuned by: Alidiamond
π Language Support
- Primary Language: Somali (so)
- Task: Binary Sentiment Classification
- Classes: Positive / Negative
- Script: Latin script
ποΈ Model Architecture
- Base: AfricanBERT (RoBERTa-based)
- Parameters: ~111 million
- Layers: 8
- Attention Heads: 6
- Hidden Size: 768
- Feed Forward Size: 3072
- Max Sequence Length: 512 tokens
- Vocabulary Size: 70,006
π― Intended Use Cases
- Social Media Analysis: Analyze sentiment in Somali tweets, Facebook posts, and comments
- Customer Feedback: Process customer reviews and feedback in Somali
- News Sentiment: Understand public sentiment towards news articles
- Market Research: Gauge public opinion on products, services, or events
- Content Moderation: Identify potentially negative or harmful content
- Academic Research: Study sentiment patterns in Somali text data
π How to Use
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the fine-tuned model and tokenizer
model_name = "Alidiamond/somali-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example Somali texts
texts = [
"Waan ku faraxsanahay adeeggan cusub", # "I am happy with this new service" (Positive)
"Waxan necebahay sida ay u dhaqmayaan", # "I hate how they behave" (Negative)
"Barnaamijkan aad buu u wanaagsan yahay", # "This program is very good" (Positive)
]
def predict_sentiment(text):
# Tokenize the input
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding=True,
max_length=512
)
# Make prediction
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = predictions[0][predicted_class].item()
# Map prediction to label
labels = {0: "Negative", 1: "Positive"}
return {
"text": text,
"sentiment": labels[predicted_class],
"confidence": confidence
}
# Analyze sentiment
for text in texts:
result = predict_sentiment(text)
print(f"Text: {result['text']}")
print(f"Sentiment: {result['sentiment']} (Confidence: {result['confidence']:.3f})")
print("-" * 50)
Batch Processing
def analyze_batch(texts):
"""Process multiple texts at once for better efficiency"""
inputs = tokenizer(
texts,
return_tensors="pt",
truncation=True,
padding=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
results = []
for i, text in enumerate(texts):
predicted_class = torch.argmax(predictions[i]).item()
confidence = predictions[i][predicted_class].item()
sentiment = "Positive" if predicted_class == 1 else "Negative"
results.append({
"text": text,
"sentiment": sentiment,
"confidence": confidence
})
return results
# Example batch processing
batch_texts = [
"Mahadsanid adeegga wanaagsan",
"Waxaan codsaneynaa in la hagaajiyo",
"Aad ayaan ugu faraxsan nahay natiijada"
]
batch_results = analyze_batch(batch_texts)
for result in batch_results:
print(f"'{result['text']}' β {result['sentiment']} ({result['confidence']:.3f})")
π Model Performance
Training Details
- Base Model: castorini/afriberta_base
- Fine-tuning Task: Binary Sentiment Classification
- Framework: Transformers
- Optimization: Fine-tuned specifically for Somali sentiment patterns
Output Format
- Label 0: Negative sentiment
- Label 1: Positive sentiment
- Confidence: Probability score (0.0 - 1.0)
π§ Technical Specifications
Model Files
config.json- Model configurationmodel.safetensors- Model weights (SafeTensors format)tokenizer_config.json- Tokenizer configurationsentencepiece.bpe.model- SentencePiece modelspecial_tokens_map.json- Special tokens mapping
Special Tokens
<s>- Beginning of sequence</s>- End of sequence<pad>- Padding token<unk>- Unknown token<mask>- Mask token
π Input Requirements
- Format: Plain text in Somali language
- Encoding: UTF-8
- Max Length: 512 tokens (longer texts will be truncated)
- Preprocessing: Automatic tokenization and normalization
π‘ Best Practices
- Text Quality: Use clean, well-formed Somali text for best results
- Context: Provide sufficient context for accurate sentiment detection
- Length: Keep texts under 512 tokens to avoid truncation
- Mixed Languages: Model works best with pure Somali text
- Batch Processing: Use batch processing for multiple texts to improve efficiency
β οΈ Limitations
- Language Specific: Optimized for Somali language only
- Binary Classification: Only distinguishes between positive and negative (no neutral class)
- Context Dependent: Performance may vary with sarcasm or complex linguistic patterns
- Domain Specific: May require additional fine-tuning for specific domains
- Dialectal Variations: Trained on standard Somali; performance may vary with regional dialects
π Base Model Information
This model is built upon castorini/afriberta_base, which was pretrained on 11 African languages:
- Afaan Oromoo (Oromo)
- Amharic
- Gahuza (Kinyarwanda and Kirundi)
- Hausa
- Igbo
- Nigerian Pidgin
- Somali
- Swahili
- Tigrinya
- YorΓΉbΓ‘
π€ Contributing
Contributions to improve the model's performance on Somali sentiment analysis are welcome. Areas for improvement:
- Additional training data
- Domain-specific fine-tuning
- Multi-class sentiment classification
- Regional dialect support
π¨βπ» Model Author
Created by: Alidiamond
Model: Alidiamond/somali-sentiment-analysis
Base Model: castorini/afriberta_base
Language: Somali (so)
Task: Sentiment Analysis (Binary Classification)
π License
Please refer to the original castorini/afriberta_base license terms for usage restrictions and requirements.
π Citation
If you use this model in your research, please cite both this work and the original AfriBERTa paper:
@misc{alidiamond2024somali,
title={Somali Sentiment Analysis using Fine-tuned AfriBERTa},
author={Alidiamond},
year={2024},
howpublished={\url{https://huggingface.co/Alidiamond/somali-sentiment-analysis}}
}
@article{afriberta,
title={AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages},
author={Kelechi Ogueji and Yuxin Zhu and Jimmy Lin},
year={2021},
journal={arXiv preprint arXiv:2104.02516}
}
π Example Outputs
Input: "Waxaan aad ugu faraxsanahay waxan arkay"
Output: Positive (Confidence: 0.892)
Input: "Ma jecelahay waxa dhacaya"
Output: Negative (Confidence: 0.756)
Input: "Barnaamijkan waa mid aad u wanaagsan"
Output: Positive (Confidence: 0.834)
π Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model directly from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("Alidiamond/somali-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("Alidiamond/somali-sentiment-analysis")
# Test with Somali text
text = "Waan ku faraxsanahay adeeggan cusub" # "I am happy with this new service"
result = model(tokenizer(text, return_tensors="pt"))
Note: This model (Alidiamond/somali-sentiment-analysis) is specifically designed for Somali language sentiment analysis. For other African languages, consider using the original castorini/afriberta_base model or fine-tune it for your specific language and task.
- Downloads last month
- 1