YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Alidiamond/somali-sentiment-analysis

📋 Model Description

Model Name: Alidiamond/somali-sentiment-analysis

This is a fine-tuned version of castorini/afriberta_base specifically optimized for Somali language sentiment analysis. The model can classify Somali text into positive or negative sentiment categories.

Base Model: castorini/afriberta_base - A multilingual African language model pretrained on 11 African languages including Somali.

Fine-tuned by: Alidiamond

🌍 Language Support

Primary Language: Somali (so)
Task: Binary Sentiment Classification
Classes: Positive / Negative
Script: Latin script

🏗️ Model Architecture

Base: AfricanBERT (RoBERTa-based)
Parameters: ~111 million
Layers: 8
Attention Heads: 6
Hidden Size: 768
Feed Forward Size: 3072
Max Sequence Length: 512 tokens
Vocabulary Size: 70,006

🎯 Intended Use Cases

Social Media Analysis: Analyze sentiment in Somali tweets, Facebook posts, and comments
Customer Feedback: Process customer reviews and feedback in Somali
News Sentiment: Understand public sentiment towards news articles
Market Research: Gauge public opinion on products, services, or events
Content Moderation: Identify potentially negative or harmful content
Academic Research: Study sentiment patterns in Somali text data

🚀 How to Use

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model and tokenizer
model_name = "Alidiamond/somali-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example Somali texts
texts = [
    "Waan ku faraxsanahay adeeggan cusub",        # "I am happy with this new service" (Positive)
    "Waxan necebahay sida ay u dhaqmayaan",       # "I hate how they behave" (Negative)
    "Barnaamijkan aad buu u wanaagsan yahay",     # "This program is very good" (Positive)
]

def predict_sentiment(text):
    # Tokenize the input
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        truncation=True, 
        padding=True, 
        max_length=512
    )
    
    # Make prediction
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = predictions[0][predicted_class].item()
    
    # Map prediction to label
    labels = {0: "Negative", 1: "Positive"}
    
    return {
        "text": text,
        "sentiment": labels[predicted_class],
        "confidence": confidence
    }

# Analyze sentiment
for text in texts:
    result = predict_sentiment(text)
    print(f"Text: {result['text']}")
    print(f"Sentiment: {result['sentiment']} (Confidence: {result['confidence']:.3f})")
    print("-" * 50)

Batch Processing

def analyze_batch(texts):
    """Process multiple texts at once for better efficiency"""
    inputs = tokenizer(
        texts, 
        return_tensors="pt", 
        truncation=True, 
        padding=True, 
        max_length=512
    )
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    results = []
    for i, text in enumerate(texts):
        predicted_class = torch.argmax(predictions[i]).item()
        confidence = predictions[i][predicted_class].item()
        sentiment = "Positive" if predicted_class == 1 else "Negative"
        
        results.append({
            "text": text,
            "sentiment": sentiment,
            "confidence": confidence
        })
    
    return results

# Example batch processing
batch_texts = [
    "Mahadsanid adeegga wanaagsan",
    "Waxaan codsaneynaa in la hagaajiyo",
    "Aad ayaan ugu faraxsan nahay natiijada"
]

batch_results = analyze_batch(batch_texts)
for result in batch_results:
    print(f"'{result['text']}' → {result['sentiment']} ({result['confidence']:.3f})")

📊 Model Performance

Training Details

Base Model: castorini/afriberta_base
Fine-tuning Task: Binary Sentiment Classification
Framework: Transformers
Optimization: Fine-tuned specifically for Somali sentiment patterns

Output Format

Label 0: Negative sentiment
Label 1: Positive sentiment
Confidence: Probability score (0.0 - 1.0)

🔧 Technical Specifications

Model Files

config.json - Model configuration
model.safetensors - Model weights (SafeTensors format)
tokenizer_config.json - Tokenizer configuration
sentencepiece.bpe.model - SentencePiece model
special_tokens_map.json - Special tokens mapping

Special Tokens

<s> - Beginning of sequence
</s> - End of sequence
<pad> - Padding token
<unk> - Unknown token
<mask> - Mask token

📝 Input Requirements

Format: Plain text in Somali language
Encoding: UTF-8
Max Length: 512 tokens (longer texts will be truncated)
Preprocessing: Automatic tokenization and normalization

💡 Best Practices

Text Quality: Use clean, well-formed Somali text for best results
Context: Provide sufficient context for accurate sentiment detection
Length: Keep texts under 512 tokens to avoid truncation
Mixed Languages: Model works best with pure Somali text
Batch Processing: Use batch processing for multiple texts to improve efficiency

⚠️ Limitations

Language Specific: Optimized for Somali language only
Binary Classification: Only distinguishes between positive and negative (no neutral class)
Context Dependent: Performance may vary with sarcasm or complex linguistic patterns
Domain Specific: May require additional fine-tuning for specific domains
Dialectal Variations: Trained on standard Somali; performance may vary with regional dialects

🔗 Base Model Information

This model is built upon castorini/afriberta_base, which was pretrained on 11 African languages:

Afaan Oromoo (Oromo)
Amharic
Gahuza (Kinyarwanda and Kirundi)
Hausa
Igbo
Nigerian Pidgin
Somali
Swahili
Tigrinya
Yorùbá

🤝 Contributing

Contributions to improve the model's performance on Somali sentiment analysis are welcome. Areas for improvement:

Additional training data
Domain-specific fine-tuning
Multi-class sentiment classification
Regional dialect support

👨‍💻 Model Author

Created by: Alidiamond
Model: Alidiamond/somali-sentiment-analysis
Base Model: castorini/afriberta_base
Language: Somali (so)
Task: Sentiment Analysis (Binary Classification)

📄 License

Please refer to the original castorini/afriberta_base license terms for usage restrictions and requirements.

📚 Citation

If you use this model in your research, please cite both this work and the original AfriBERTa paper:

@misc{alidiamond2024somali,
    title={Somali Sentiment Analysis using Fine-tuned AfriBERTa},
    author={Alidiamond},
    year={2024},
    howpublished={\url{https://huggingface.co/Alidiamond/somali-sentiment-analysis}}
}

@article{afriberta,
    title={AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages},
    author={Kelechi Ogueji and Yuxin Zhu and Jimmy Lin},
    year={2021},
    journal={arXiv preprint arXiv:2104.02516}
}

🔍 Example Outputs

Input: "Waxaan aad ugu faraxsanahay waxan arkay"
Output: Positive (Confidence: 0.892)

Input: "Ma jecelahay waxa dhacaya"
Output: Negative (Confidence: 0.756)

Input: "Barnaamijkan waa mid aad u wanaagsan"
Output: Positive (Confidence: 0.834)

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model directly from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("Alidiamond/somali-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("Alidiamond/somali-sentiment-analysis")

# Test with Somali text
text = "Waan ku faraxsanahay adeeggan cusub"  # "I am happy with this new service"
result = model(tokenizer(text, return_tensors="pt"))

Note: This model (Alidiamond/somali-sentiment-analysis) is specifically designed for Somali language sentiment analysis. For other African languages, consider using the original castorini/afriberta_base model or fine-tune it for your specific language and task.

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support