Speech-to-Emoji Classifier

This repository releases a lightweight speech-to-emoji classifier used for Emoji-TTS evaluation. It predicts 1 label from 11 classes: 10 emojis plus a "<none>" class.

Model

  • Base speech encoder: emotion2vec/emotion2vec_base
  • Feature type: utterance-level embedding
  • Classifier head: 3-layer MLP
  • Input dimension: 768
  • Hidden dimension: 512
  • Dropout: 0.2
  • Labels: 11

The released best_classifier.pt contains the trained MLP head and label metadata. To run inference, you also need the upstream emotion2vec/emotion2vec_base model to extract utterance embeddings.

Training Data

The model was trained on 21,940 English TTS utterances built from 22,000 source texts:

  • 20,000 source texts carry 1 of 10 emoji labels, with 2,000 texts per emoji
  • 2,000 source texts are used as the "<none>" class
  • Each text is synthesized with Gemini TTS (flash or pro)
  • After merging available TTS outputs and filtering missing audio, 21,940 utterances remain overall
  • 19,945 retained utterances are emoji-labeled
  • 1,995 retained utterances are "<none>"

Balanced split:

  • Train: 19,940
  • Dev: 1,000
  • Test: 1,000

Input and Output

  • Input: one utterance-level speech waveform
  • Output: one top-1 label from the following set:
  • <none>
  • 😭
  • πŸ˜‚
  • 😊
  • 😍
  • πŸ™
  • πŸ€”
  • πŸ”₯
  • πŸ’”
  • πŸ‘
  • πŸ™„

The training records also include source text, but the released classifier itself predicts from speech only. In the original evaluation pipeline, it is used as an in-domain classifier for emoji-conditioned speech rather than as a general-purpose emotion recognizer.

Held-out Performance

On the balanced 1,000-example test split:

  • Accuracy: 0.695
  • Macro-F1: 0.6942

Files

  • best_classifier.pt: trained classifier checkpoint
  • label2id.json: label-to-index mapping
  • summary.json: sanitized training and evaluation summary

Limitations

  • Trained on English utterance-level speech only
  • Trained on a synthetic / curated emoji speech corpus, not broad real-world speech
  • Intended for in-domain controllability evaluation
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support