File size: 6,508 Bytes
792d931
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
language: en
tags:
- audio
- emotion-recognition
- speech
- pytorch
- cnn
- ravdess
license: mit
datasets:
- ravdess
metrics:
- accuracy
- f1
model-index:
- name: speech-emotion-recognition-v2
  results:
  - task:
      type: audio-classification
      name: Speech Emotion Recognition
    dataset:
      name: RAVDESS
      type: ravdess
    metrics:
    - type: accuracy
      value: 75.0
      name: Validation Accuracy
    - type: accuracy
      value: 66.2
      name: Test Accuracy
---

# Speech Emotion Recognition (Enhanced Model V2)

## Model Description

This model is a deep CNN-based classifier for detecting emotions from speech audio. It achieves **75% validation accuracy** and **66.2% test accuracy** on the RAVDESS dataset through enhanced feature extraction, residual connections, and attention mechanisms.

### Model Architecture

- **Type**: Convolutional Neural Network with Residual Blocks
- **Parameters**: 11,873,480
- **Input**: 196-dimensional audio features × 128 time steps
- **Output**: 8 emotion classes

**Architecture Details:**
- 4 Residual Layers (2 blocks each)
- Channel Attention Mechanisms
- Dual Global Pooling (Average + Max)
- Fully Connected Layers: 1024 → 512 → 256 → 8

### Features (196 dimensions)

- Mel-spectrograms: 128 bands
- MFCCs: 13 coefficients
- Delta MFCCs: 13 (temporal dynamics)
- Delta-Delta MFCCs: 13 (acceleration)
- Chromagram: 12 (pitch content)
- Spectral Contrast: 7 (texture)
- Tonnetz: 6 (harmonic content)
- Additional: 4 (ZCR, centroid, rolloff, bandwidth)

## Intended Use

### Primary Use Cases

- Emotion detection from speech audio
- Affective computing research
- Human-computer interaction
- Mental health monitoring
- Call center analytics

### Out-of-Scope Use

- Real-time streaming audio (model requires 3-second clips)
- Non-speech audio (music, environmental sounds)
- Languages other than English
- Clinical diagnosis without professional oversight

## Training Data

**RAVDESS** (Ryerson Audio-Visual Database of Emotional Speech and Song)
- 1,440 speech files
- 8 emotion classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised
- 24 professional actors (12 male, 12 female)
- Controlled recording environment
- Split: 70% train, 15% validation, 15% test

## Performance

### Overall Metrics

| Metric | Value |
|--------|-------|
| Validation Accuracy | 75.00% |
| Test Accuracy | 66.20% |
| Macro F1-Score | 0.660 |
| Weighted F1-Score | 0.658 |

### Per-Class Performance (Test Set)

| Emotion | Accuracy | Precision | Recall | F1-Score |
|---------|----------|-----------|--------|----------|
| Neutral | 71.43% | 0.667 | 0.714 | 0.690 |
| Calm | 85.71% | 0.686 | 0.857 | 0.762 |
| Happy | 58.62% | 0.531 | 0.586 | 0.557 |
| Sad | 51.72% | 0.500 | 0.517 | 0.508 |
| Angry | 68.97% | 0.769 | 0.690 | 0.727 |
| Fearful | 41.38% | 0.706 | 0.414 | 0.522 |
| Disgust | 75.86% | 0.688 | 0.759 | 0.721 |
| Surprised | 79.31% | 0.793 | 0.793 | 0.793 |

### Comparison with Baseline

| Metric | Baseline | Enhanced | Improvement |
|--------|----------|----------|-------------|
| Validation Acc | 38.89% | 75.00% | +36.11% |
| Test Acc | 39.81% | 66.20% | +26.39% |
| Parameters | 536K | 11.8M | 22x |

## Usage

### Installation

```bash
pip install torch torchaudio librosa numpy
```

### Quick Start

```python
import torch
import librosa
import numpy as np
from models.emotion_cnn_v2 import ImprovedEmotionCNN
from data.prepare_data import extract_features

# Load model
model = ImprovedEmotionCNN(num_classes=8)
checkpoint = torch.load('best_model_v2.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load and process audio
features = extract_features('path/to/audio.wav')
features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0)

# Predict
with torch.no_grad():
    output = model(features_tensor)
    probs = torch.softmax(output, dim=1)
    predicted_idx = output.argmax(1).item()

emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
print(f"Predicted: {emotions[predicted_idx]} ({probs[0][predicted_idx]:.2%})")
```

## Limitations

### Known Issues

1. **Fearful Emotion**: Lower accuracy (41.38%) - often confused with other negative emotions
2. **Test-Validation Gap**: 75% validation vs 66.2% test suggests some overfitting
3. **Dataset Bias**: Trained on professional actors in controlled environment
4. **Language**: English only
5. **Audio Quality**: Requires clear speech without background noise

### Ethical Considerations

- **Privacy**: Emotion detection from voice raises privacy concerns
- **Bias**: May not generalize well across different demographics, accents, or cultures
- **Misuse**: Should not be used for surveillance or manipulation
- **Context**: Emotions are complex and context-dependent; model provides probabilities, not certainties

## Training Procedure

### Hyperparameters

```python
{
    'batch_size': 24,
    'learning_rate': 0.001,
    'epochs': 150,
    'optimizer': 'AdamW',
    'weight_decay': 1e-4,
    'loss': 'CrossEntropyLoss + Label Smoothing (0.1)',
    'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)',
    'early_stopping': 'patience=20',
    'mixed_precision': 'FP16',
    'gradient_clipping': 'max_norm=1.0'
}
```

### Data Augmentation

- SpecAugment (time and frequency masking)
- Gaussian noise injection
- Time shifting
- Augmentation probability: 60%

### Hardware

- GPU: NVIDIA RTX 5060 Ti
- Training Time: ~2.5 hours (150 epochs)
- CUDA: 13.0
- PyTorch: 2.0+

## Citation

If you use this model, please cite:

```bibtex
@misc{speech-emotion-recognition-v2,
  title={Speech Emotion Recognition with Enhanced CNN},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/yourusername/speech-emotion-recognition}}
}
```

### RAVDESS Dataset Citation

```bibtex
@article{livingstone2018ravdess,
  title={The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)},
  author={Livingstone, Steven R and Russo, Frank A},
  journal={PLoS ONE},
  volume={13},
  number={5},
  pages={e0196391},
  year={2018},
  publisher={Public Library of Science}
}
```

## License

MIT License - See LICENSE file for details

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/speech-emotion-recognition).

## Acknowledgments

- RAVDESS dataset creators
- PyTorch team
- librosa developers
- Hugging Face community