Speech-to-Emoji Classifier

This repository releases a lightweight speech-to-emoji classifier used for Emoji-TTS evaluation. It predicts 1 label from 11 classes: 10 emojis plus a "<none>" class.

Model

Base speech encoder: emotion2vec/emotion2vec_base
Feature type: utterance-level embedding
Classifier head: 3-layer MLP
Input dimension: 768
Hidden dimension: 512
Dropout: 0.2
Labels: 11

The released best_classifier.pt contains the trained MLP head and label metadata. To run inference, you also need the upstream emotion2vec/emotion2vec_base model to extract utterance embeddings.

Training Data

The model was trained on 21,940 English TTS utterances built from 22,000 source texts:

20,000 source texts carry 1 of 10 emoji labels, with 2,000 texts per emoji
2,000 source texts are used as the "<none>" class
Each text is synthesized with Gemini TTS (flash or pro)
After merging available TTS outputs and filtering missing audio, 21,940 utterances remain overall
19,945 retained utterances are emoji-labeled
1,995 retained utterances are "<none>"

Balanced split:

Train: 19,940
Dev: 1,000
Test: 1,000

Input and Output

Input: one utterance-level speech waveform
Output: one top-1 label from the following set:
<none>
😭
😂
😊
😍
🙏
🤔
🔥
💔
👏
🙄

The training records also include source text, but the released classifier itself predicts from speech only. In the original evaluation pipeline, it is used as an in-domain classifier for emoji-conditioned speech rather than as a general-purpose emotion recognizer.

Held-out Performance

On the balanced 1,000-example test split:

Accuracy: 0.695
Macro-F1: 0.6942

Files

best_classifier.pt: trained classifier checkpoint
label2id.json: label-to-index mapping
summary.json: sanitized training and evaluation summary

Limitations

Trained on English utterance-level speech only
Trained on a synthetic / curated emoji speech corpus, not broad real-world speech
Intended for in-domain controllability evaluation

Downloads last month: -; Downloads are not tracked for this model. How to track