|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: microsoft/wavlm-large |
|
|
tags: |
|
|
- audio-classification |
|
|
- voice-detection |
|
|
- speech-recognition |
|
|
- pytorch |
|
|
- transformers |
|
|
- real-time |
|
|
- production-ready |
|
|
- alexa-inspired |
|
|
- wav2vec |
|
|
- multi-scale |
|
|
- attention-pooling |
|
|
datasets: |
|
|
- speech_commands |
|
|
- common_voice |
|
|
- voxceleb |
|
|
- librispeech |
|
|
language: |
|
|
- en |
|
|
- es |
|
|
- fr |
|
|
- de |
|
|
- it |
|
|
- pt |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
pipeline_tag: audio-classification |
|
|
widget: |
|
|
- example_title: "Voice Detection" |
|
|
src: https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac |
|
|
model-index: |
|
|
- name: zenvion-voice-detector-v0.4 |
|
|
results: |
|
|
- task: |
|
|
type: audio-classification |
|
|
name: Audio Classification |
|
|
dataset: |
|
|
type: speech_commands |
|
|
name: Speech Commands |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.96 |
|
|
- type: f1 |
|
|
value: 0.95 |
|
|
--- |
|
|
|
|
|
# Zenvion Voice Detector v0.4 |
|
|
|
|
|
Advanced voice activity detection model incorporating cutting-edge technologies from industry leaders. |
|
|
|
|
|
## What's New in v0.4 |
|
|
|
|
|
### Advanced Architecture |
|
|
- **Alexa-Style Attention**: Inspired by Amazon's voice assistant technology |
|
|
- **Meta Audio Encoder**: Leveraging Facebook's wav2vec research |
|
|
- **Multi-Scale Convolutions**: Enhanced temporal pattern recognition |
|
|
- **Dynamic Attention Pooling**: Adaptive feature aggregation |
|
|
- **Advanced Transformer Blocks**: State-of-the-art sequence modeling |
|
|
|
|
|
### Performance Improvements |
|
|
- 96% accuracy (up from 94%) |
|
|
- 95% F1-score (up from 93%) |
|
|
- Reduced latency: ~40ms (down from 50ms) |
|
|
- Enhanced multilingual support |
|
|
- Better noise robustness |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Real-time Processing**: Optimized for production environments |
|
|
- **Multi-language Support**: 6+ languages with high accuracy |
|
|
- **Industry-Inspired Design**: Incorporates proven techniques from tech giants |
|
|
- **Scalable Architecture**: Suitable for edge to cloud deployment |
|
|
- **Advanced Pooling**: Multiple pooling strategies for robust features |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
import torch |
|
|
|
|
|
# Load the advanced model |
|
|
detector = pipeline( |
|
|
"audio-classification", |
|
|
model="Darveht/zenvion-voice-detector-v0.4", |
|
|
device=0 if torch.cuda.is_available() else -1 |
|
|
) |
|
|
|
|
|
# Process audio |
|
|
result = detector("audio_file.wav") |
|
|
print(f"Detection: {result}") |
|
|
``` |
|
|
|
|
|
## Technical Architecture |
|
|
|
|
|
### Core Components |
|
|
1. **WavLM Backbone**: Foundation model for audio understanding |
|
|
2. **Alexa Attention**: Multi-head attention with wake word detection |
|
|
3. **Meta Encoder**: Contrastive learning and quantization |
|
|
4. **Transformer Stack**: 4 advanced transformer blocks |
|
|
5. **Multi-Scale Conv**: Parallel convolutions with different kernel sizes |
|
|
6. **Dynamic Pooling**: Attention-weighted, max, and mean pooling |
|
|
|
|
|
### Integration Capabilities |
|
|
- AWS Comprehend for text analysis |
|
|
- AWS Transcribe for speech-to-text |
|
|
- Boto3 integration for cloud services |
|
|
- Scalable deployment options |
|
|
|
|
|
## Performance Benchmarks |
|
|
|
|
|
| Metric | v0.3 | v0.4 | Improvement | |
|
|
|--------|------|------|-------------| |
|
|
| Accuracy | 94% | 96% | +2% | |
|
|
| F1-Score | 93% | 95% | +2% | |
|
|
| Latency | 50ms | 40ms | -20% | |
|
|
| Languages | 2 | 6+ | +200% | |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch torchaudio boto3 fairseq |
|
|
``` |
|
|
|
|
|
## Advanced Usage |
|
|
|
|
|
### With AWS Integration |
|
|
```python |
|
|
from advanced_model_v04 import ZenvionVoiceDetectorV04, AWSIntegration |
|
|
|
|
|
model = ZenvionVoiceDetectorV04() |
|
|
aws_integration = AWSIntegration() |
|
|
|
|
|
# Enhanced processing with AWS services |
|
|
result = model(audio_input) |
|
|
sentiment = aws_integration.enhance_with_comprehend(transcribed_text) |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
```python |
|
|
# Process multiple audio files |
|
|
results = detector(["audio1.wav", "audio2.wav", "audio3.wav"]) |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Parameters**: 350M+ (optimized for performance) |
|
|
- **Input**: 16kHz audio, variable length |
|
|
- **Output**: Voice/No-voice classification with confidence scores |
|
|
- **Training**: Multi-dataset training with advanced augmentation |
|
|
- **Optimization**: Mixed precision, gradient accumulation |
|
|
|
|
|
## Applications |
|
|
|
|
|
- Voice assistants and smart speakers |
|
|
- Call center analytics |
|
|
- Podcast and media processing |
|
|
- Security and surveillance systems |
|
|
- IoT device voice activation |
|
|
- Real-time communication platforms |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for 16kHz sampling rate |
|
|
- Performance varies in extremely noisy environments |
|
|
- Requires sufficient computational resources for real-time processing |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{zenvion-voice-detector-v04, |
|
|
title={Zenvion Voice Detector v0.4: Advanced Voice Activity Detection}, |
|
|
author={Darveht}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/Darveht/zenvion-voice-detector-v0.4} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 - Free for commercial and research use |
|
|
|