Speech-to-Emoji Classifier
This repository releases a lightweight speech-to-emoji classifier used for Emoji-TTS evaluation.
It predicts 1 label from 11 classes: 10 emojis plus a "<none>" class.
Model
- Base speech encoder:
emotion2vec/emotion2vec_base - Feature type: utterance-level embedding
- Classifier head: 3-layer MLP
- Input dimension: 768
- Hidden dimension: 512
- Dropout: 0.2
- Labels: 11
The released best_classifier.pt contains the trained MLP head and label metadata.
To run inference, you also need the upstream emotion2vec/emotion2vec_base model to extract utterance embeddings.
Training Data
The model was trained on 21,940 English TTS utterances built from 22,000 source texts:
- 20,000 source texts carry 1 of 10 emoji labels, with 2,000 texts per emoji
- 2,000 source texts are used as the
"<none>"class - Each text is synthesized with Gemini TTS (
flashorpro) - After merging available TTS outputs and filtering missing audio, 21,940 utterances remain overall
- 19,945 retained utterances are emoji-labeled
- 1,995 retained utterances are
"<none>"
Balanced split:
- Train: 19,940
- Dev: 1,000
- Test: 1,000
Input and Output
- Input: one utterance-level speech waveform
- Output: one top-1 label from the following set:
<none>ππππππ€π₯πππ
The training records also include source text, but the released classifier itself predicts from speech only. In the original evaluation pipeline, it is used as an in-domain classifier for emoji-conditioned speech rather than as a general-purpose emotion recognizer.
Held-out Performance
On the balanced 1,000-example test split:
- Accuracy:
0.695 - Macro-F1:
0.6942
Files
best_classifier.pt: trained classifier checkpointlabel2id.json: label-to-index mappingsummary.json: sanitized training and evaluation summary
Limitations
- Trained on English utterance-level speech only
- Trained on a synthetic / curated emoji speech corpus, not broad real-world speech
- Intended for in-domain controllability evaluation