sentiv / README.md
ducdatit2002's picture
Update README.md
657df2d verified
metadata
license: cc-by-nc-4.0

SentiV

A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding

This repository releases datasets, code, and pretrained checkpoints for SentiV, a benchmark for Vietnamese emotion understanding across text, speech, and multimodal settings, as described in our paper.

πŸ“„ Paper: SentiV: A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding


1. Overview

SentiV focuses on realistic low-resource evaluation for Vietnamese emotion recognition under:

  • Label imbalance
  • Limited supervision (1–100% label budgets)
  • Cross-dataset and cross-modal generalization
  • Explicit label-space alignment between text and speech

We release:

  • Text emotion dataset (data + code + checkpoints)
  • Speech emotion annotations (labels + code + checkpoints)
  • Reproducible training and evaluation scripts

2. Repository Structure

sentiv/
β”œβ”€β”€ text-training/
β”‚   β”œβ”€β”€ model/                # Text model checkpoints
β”‚   β”œβ”€β”€ train_PhoBERT.py      # Training script (PhoBERT)
β”‚   β”œβ”€β”€ train.xlsx            # Labeled text data
β”‚   └── readme.MD
β”‚
β”œβ”€β”€ voice-training/
β”‚   β”œβ”€β”€ hubert-large-ls960/   # Speech model checkpoints
β”‚   β”œβ”€β”€ label/                # Emotion labels and split manifests
β”‚   β”œβ”€β”€ train_hubert.py       # HuBERT fine-tuning script
β”‚   └── readme.MD
β”‚
└── README.md                 # This file

3. Tasks and Label Space

Task A: Text Emotion Classification

  • Labels (7): Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise
  • Dataset: social media text (comments, posts)
  • Evaluation: Macro-F1, Accuracy

Task B: Speech Emotion Classification

  • Labels (6): Anger (includes Disgust), Enjoyment, Fear, Neutral, Sadness, Surprise
  • Disgust is merged into Anger due to extreme scarcity in speech data

Task C: Multimodal Speech–Text Classification

  • Same 6-label space as speech
  • Late fusion over text and speech logits

4. Text Modality (text-training)

Data

  • Source: public Vietnamese social media posts
  • Size: 265,011 labeled samples
  • Average length: ~20 words
  • Labels: 7 emotions
  • Anonymized and released strictly for research use

Model

  • Backbone: PhoBERT (vinai/phobert-base)
  • Loss: Focal Loss with class reweighting
  • Max sequence length: 256
  • Metric: Macro-F1

Training

python train_PhoBERT.py

The script supports:

  • Class imbalance handling
  • Oversampling
  • Low-resource label budgets
  • Fixed train/dev/test splits

5. Speech Modality (voice-training)

Data

  • Source audio: VietSpeech dataset (batches 0–10)

  • We release:

    • Emotion labels
    • Split manifests
    • Training code
  • Raw audio must be obtained from the original VietSpeech source under its license

Label Mapping

  • Disgust is merged into Anger for training stability
  • Final label space: 6 emotions

Model

  • Backbone: HuBERT Large (ls960)
  • Input: 16 kHz audio, max 8 seconds
  • Loss: Weighted Cross-Entropy
  • Sampler: WeightedRandomSampler
  • Metric: Macro-F1

Training

python train_hubert.py

6. Multimodal Fusion

We adopt late fusion at logit level for reproducibility.

Fusion Strategy

  • Average fusion
  • Concatenation + MLP
  • Uncertainty-aware late fusion (main method)

Confidence is estimated from entropy or max probability, and fusion weights are adjusted dynamically to down-weight unreliable modalities.


7. Low-Resource Evaluation Protocol

  • Label budgets: 1%, 5%, 10%, 25%, 50%, 100%
  • Fixed test set
  • Only training data is subsampled
  • 3–5 random seeds per setting
  • Report mean Β± std

This protocol is designed to reflect realistic variance under limited supervision.


8. Ethics and Licensing

Text Data

  • Collected from publicly available social media
  • All user-identifying information removed
  • Research-only use
  • Takedown requests supported

Speech Data

  • Based on VietSpeech
  • Speakers provided research consent
  • We release labels and derived artifacts only

Users must comply with original dataset licenses.


9. Access Policy

This repository is released via Hugging Face with access control enabled.

  • Users must request access
  • Access is granted manually for research purposes
  • Redistribution without permission is not allowed

10. Citation

If you use SentiV, please cite our paper:

@article{sentiv2026,
  title     = {SentiV: A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding},
  author    = {Anonymous},
  url       = {https://huggingface.co/ducdatit2002/sentiv/edit},
  year      = {2026}
}