File size: 6,508 Bytes

792d931

---
language: en
tags:
- audio
- emotion-recognition
- speech
- pytorch
- cnn
- ravdess
license: mit
datasets:
- ravdess
metrics:
- accuracy
- f1
model-index:
- name: speech-emotion-recognition-v2
  results:
  - task:
      type: audio-classification
      name: Speech Emotion Recognition
    dataset:
      name: RAVDESS
      type: ravdess
    metrics:
    - type: accuracy
      value: 75.0
      name: Validation Accuracy
    - type: accuracy
      value: 66.2
      name: Test Accuracy
---

# Speech Emotion Recognition (Enhanced Model V2)

## Model Description

This model is a deep CNN-based classifier for detecting emotions from speech audio. It achieves **75% validation accuracy** and **66.2% test accuracy** on the RAVDESS dataset through enhanced feature extraction, residual connections, and attention mechanisms.

### Model Architecture

- **Type**: Convolutional Neural Network with Residual Blocks
- **Parameters**: 11,873,480
- **Input**: 196-dimensional audio features × 128 time steps
- **Output**: 8 emotion classes

**Architecture Details:**
- 4 Residual Layers (2 blocks each)
- Channel Attention Mechanisms
- Dual Global Pooling (Average + Max)
- Fully Connected Layers: 1024 → 512 → 256 → 8

### Features (196 dimensions)

- Mel-spectrograms: 128 bands
- MFCCs: 13 coefficients
- Delta MFCCs: 13 (temporal dynamics)
- Delta-Delta MFCCs: 13 (acceleration)
- Chromagram: 12 (pitch content)
- Spectral Contrast: 7 (texture)
- Tonnetz: 6 (harmonic content)
- Additional: 4 (ZCR, centroid, rolloff, bandwidth)

## Intended Use

### Primary Use Cases

- Emotion detection from speech audio
- Affective computing research
- Human-computer interaction
- Mental health monitoring
- Call center analytics

### Out-of-Scope Use

- Real-time streaming audio (model requires 3-second clips)
- Non-speech audio (music, environmental sounds)
- Languages other than English
- Clinical diagnosis without professional oversight

## Training Data

**RAVDESS** (Ryerson Audio-Visual Database of Emotional Speech and Song)
- 1,440 speech files
- 8 emotion classes: neutral, calm, happy, sad, angry, fearful, disgust, surprised
- 24 professional actors (12 male, 12 female)
- Controlled recording environment
- Split: 70% train, 15% validation, 15% test

## Performance

### Overall Metrics

| Metric | Value |
|--------|-------|
| Validation Accuracy | 75.00% |
| Test Accuracy | 66.20% |
| Macro F1-Score | 0.660 |
| Weighted F1-Score | 0.658 |

### Per-Class Performance (Test Set)

| Emotion | Accuracy | Precision | Recall | F1-Score |
|---------|----------|-----------|--------|----------|
| Neutral | 71.43% | 0.667 | 0.714 | 0.690 |
| Calm | 85.71% | 0.686 | 0.857 | 0.762 |
| Happy | 58.62% | 0.531 | 0.586 | 0.557 |
| Sad | 51.72% | 0.500 | 0.517 | 0.508 |
| Angry | 68.97% | 0.769 | 0.690 | 0.727 |
| Fearful | 41.38% | 0.706 | 0.414 | 0.522 |
| Disgust | 75.86% | 0.688 | 0.759 | 0.721 |
| Surprised | 79.31% | 0.793 | 0.793 | 0.793 |

### Comparison with Baseline

| Metric | Baseline | Enhanced | Improvement |
|--------|----------|----------|-------------|
| Validation Acc | 38.89% | 75.00% | +36.11% |
| Test Acc | 39.81% | 66.20% | +26.39% |
| Parameters | 536K | 11.8M | 22x |

## Usage

### Installation

```bash
pip install torch torchaudio librosa numpy
```

### Quick Start

```python
import torch
import librosa
import numpy as np
from models.emotion_cnn_v2 import ImprovedEmotionCNN
from data.prepare_data import extract_features

# Load model
model = ImprovedEmotionCNN(num_classes=8)
checkpoint = torch.load('best_model_v2.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load and process audio
features = extract_features('path/to/audio.wav')
features_tensor = torch.FloatTensor(features).unsqueeze(0).unsqueeze(0)

# Predict
with torch.no_grad():
    output = model(features_tensor)
    probs = torch.softmax(output, dim=1)
    predicted_idx = output.argmax(1).item()

emotions = ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
print(f"Predicted: {emotions[predicted_idx]} ({probs[0][predicted_idx]:.2%})")
```

## Limitations

### Known Issues

1. **Fearful Emotion**: Lower accuracy (41.38%) - often confused with other negative emotions
2. **Test-Validation Gap**: 75% validation vs 66.2% test suggests some overfitting
3. **Dataset Bias**: Trained on professional actors in controlled environment
4. **Language**: English only
5. **Audio Quality**: Requires clear speech without background noise

### Ethical Considerations

- **Privacy**: Emotion detection from voice raises privacy concerns
- **Bias**: May not generalize well across different demographics, accents, or cultures
- **Misuse**: Should not be used for surveillance or manipulation
- **Context**: Emotions are complex and context-dependent; model provides probabilities, not certainties

## Training Procedure

### Hyperparameters

```python
{
    'batch_size': 24,
    'learning_rate': 0.001,
    'epochs': 150,
    'optimizer': 'AdamW',
    'weight_decay': 1e-4,
    'loss': 'CrossEntropyLoss + Label Smoothing (0.1)',
    'lr_scheduler': 'ReduceLROnPlateau (patience=8, factor=0.5)',
    'early_stopping': 'patience=20',
    'mixed_precision': 'FP16',
    'gradient_clipping': 'max_norm=1.0'
}
```

### Data Augmentation

- SpecAugment (time and frequency masking)
- Gaussian noise injection
- Time shifting
- Augmentation probability: 60%

### Hardware

- GPU: NVIDIA RTX 5060 Ti
- Training Time: ~2.5 hours (150 epochs)
- CUDA: 13.0
- PyTorch: 2.0+

## Citation

If you use this model, please cite:

```bibtex
@misc{speech-emotion-recognition-v2,
  title={Speech Emotion Recognition with Enhanced CNN},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/yourusername/speech-emotion-recognition}}
}
```

### RAVDESS Dataset Citation

```bibtex
@article{livingstone2018ravdess,
  title={The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)},
  author={Livingstone, Steven R and Russo, Frank A},
  journal={PLoS ONE},
  volume={13},
  number={5},
  pages={e0196391},
  year={2018},
  publisher={Public Library of Science}
}
```

## License

MIT License - See LICENSE file for details

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/speech-emotion-recognition).

## Acknowledgments

- RAVDESS dataset creators
- PyTorch team
- librosa developers
- Hugging Face community