YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Alidiamond/somali-sentiment-analysis

πŸ“‹ Model Description

Model Name: Alidiamond/somali-sentiment-analysis

This is a fine-tuned version of castorini/afriberta_base specifically optimized for Somali language sentiment analysis. The model can classify Somali text into positive or negative sentiment categories.

Base Model: castorini/afriberta_base - A multilingual African language model pretrained on 11 African languages including Somali.

Fine-tuned by: Alidiamond

🌍 Language Support

  • Primary Language: Somali (so)
  • Task: Binary Sentiment Classification
  • Classes: Positive / Negative
  • Script: Latin script

πŸ—οΈ Model Architecture

  • Base: AfricanBERT (RoBERTa-based)
  • Parameters: ~111 million
  • Layers: 8
  • Attention Heads: 6
  • Hidden Size: 768
  • Feed Forward Size: 3072
  • Max Sequence Length: 512 tokens
  • Vocabulary Size: 70,006

🎯 Intended Use Cases

  • Social Media Analysis: Analyze sentiment in Somali tweets, Facebook posts, and comments
  • Customer Feedback: Process customer reviews and feedback in Somali
  • News Sentiment: Understand public sentiment towards news articles
  • Market Research: Gauge public opinion on products, services, or events
  • Content Moderation: Identify potentially negative or harmful content
  • Academic Research: Study sentiment patterns in Somali text data

πŸš€ How to Use

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model and tokenizer
model_name = "Alidiamond/somali-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example Somali texts
texts = [
    "Waan ku faraxsanahay adeeggan cusub",        # "I am happy with this new service" (Positive)
    "Waxan necebahay sida ay u dhaqmayaan",       # "I hate how they behave" (Negative)
    "Barnaamijkan aad buu u wanaagsan yahay",     # "This program is very good" (Positive)
]

def predict_sentiment(text):
    # Tokenize the input
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        truncation=True, 
        padding=True, 
        max_length=512
    )
    
    # Make prediction
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = predictions[0][predicted_class].item()
    
    # Map prediction to label
    labels = {0: "Negative", 1: "Positive"}
    
    return {
        "text": text,
        "sentiment": labels[predicted_class],
        "confidence": confidence
    }

# Analyze sentiment
for text in texts:
    result = predict_sentiment(text)
    print(f"Text: {result['text']}")
    print(f"Sentiment: {result['sentiment']} (Confidence: {result['confidence']:.3f})")
    print("-" * 50)

Batch Processing

def analyze_batch(texts):
    """Process multiple texts at once for better efficiency"""
    inputs = tokenizer(
        texts, 
        return_tensors="pt", 
        truncation=True, 
        padding=True, 
        max_length=512
    )
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    results = []
    for i, text in enumerate(texts):
        predicted_class = torch.argmax(predictions[i]).item()
        confidence = predictions[i][predicted_class].item()
        sentiment = "Positive" if predicted_class == 1 else "Negative"
        
        results.append({
            "text": text,
            "sentiment": sentiment,
            "confidence": confidence
        })
    
    return results

# Example batch processing
batch_texts = [
    "Mahadsanid adeegga wanaagsan",
    "Waxaan codsaneynaa in la hagaajiyo",
    "Aad ayaan ugu faraxsan nahay natiijada"
]

batch_results = analyze_batch(batch_texts)
for result in batch_results:
    print(f"'{result['text']}' β†’ {result['sentiment']} ({result['confidence']:.3f})")

πŸ“Š Model Performance

Training Details

  • Base Model: castorini/afriberta_base
  • Fine-tuning Task: Binary Sentiment Classification
  • Framework: Transformers
  • Optimization: Fine-tuned specifically for Somali sentiment patterns

Output Format

  • Label 0: Negative sentiment
  • Label 1: Positive sentiment
  • Confidence: Probability score (0.0 - 1.0)

πŸ”§ Technical Specifications

Model Files

  • config.json - Model configuration
  • model.safetensors - Model weights (SafeTensors format)
  • tokenizer_config.json - Tokenizer configuration
  • sentencepiece.bpe.model - SentencePiece model
  • special_tokens_map.json - Special tokens mapping

Special Tokens

  • <s> - Beginning of sequence
  • </s> - End of sequence
  • <pad> - Padding token
  • <unk> - Unknown token
  • <mask> - Mask token

πŸ“ Input Requirements

  • Format: Plain text in Somali language
  • Encoding: UTF-8
  • Max Length: 512 tokens (longer texts will be truncated)
  • Preprocessing: Automatic tokenization and normalization

πŸ’‘ Best Practices

  1. Text Quality: Use clean, well-formed Somali text for best results
  2. Context: Provide sufficient context for accurate sentiment detection
  3. Length: Keep texts under 512 tokens to avoid truncation
  4. Mixed Languages: Model works best with pure Somali text
  5. Batch Processing: Use batch processing for multiple texts to improve efficiency

⚠️ Limitations

  • Language Specific: Optimized for Somali language only
  • Binary Classification: Only distinguishes between positive and negative (no neutral class)
  • Context Dependent: Performance may vary with sarcasm or complex linguistic patterns
  • Domain Specific: May require additional fine-tuning for specific domains
  • Dialectal Variations: Trained on standard Somali; performance may vary with regional dialects

πŸ”— Base Model Information

This model is built upon castorini/afriberta_base, which was pretrained on 11 African languages:

  • Afaan Oromoo (Oromo)
  • Amharic
  • Gahuza (Kinyarwanda and Kirundi)
  • Hausa
  • Igbo
  • Nigerian Pidgin
  • Somali
  • Swahili
  • Tigrinya
  • YorΓΉbΓ‘

🀝 Contributing

Contributions to improve the model's performance on Somali sentiment analysis are welcome. Areas for improvement:

  • Additional training data
  • Domain-specific fine-tuning
  • Multi-class sentiment classification
  • Regional dialect support

πŸ‘¨β€πŸ’» Model Author

Created by: Alidiamond
Model: Alidiamond/somali-sentiment-analysis
Base Model: castorini/afriberta_base
Language: Somali (so)
Task: Sentiment Analysis (Binary Classification)

πŸ“„ License

Please refer to the original castorini/afriberta_base license terms for usage restrictions and requirements.

πŸ“š Citation

If you use this model in your research, please cite both this work and the original AfriBERTa paper:

@misc{alidiamond2024somali,
    title={Somali Sentiment Analysis using Fine-tuned AfriBERTa},
    author={Alidiamond},
    year={2024},
    howpublished={\url{https://huggingface.co/Alidiamond/somali-sentiment-analysis}}
}

@article{afriberta,
    title={AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages},
    author={Kelechi Ogueji and Yuxin Zhu and Jimmy Lin},
    year={2021},
    journal={arXiv preprint arXiv:2104.02516}
}

πŸ” Example Outputs

Input: "Waxaan aad ugu faraxsanahay waxan arkay"
Output: Positive (Confidence: 0.892)

Input: "Ma jecelahay waxa dhacaya"
Output: Negative (Confidence: 0.756)

Input: "Barnaamijkan waa mid aad u wanaagsan"
Output: Positive (Confidence: 0.834)

πŸš€ Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model directly from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("Alidiamond/somali-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("Alidiamond/somali-sentiment-analysis")

# Test with Somali text
text = "Waan ku faraxsanahay adeeggan cusub"  # "I am happy with this new service"
result = model(tokenizer(text, return_tensors="pt"))

Note: This model (Alidiamond/somali-sentiment-analysis) is specifically designed for Somali language sentiment analysis. For other African languages, consider using the original castorini/afriberta_base model or fine-tune it for your specific language and task.

Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support