Dolphin: Efficient Audio-Visual Speech Separation
Model Overview
Dolphin is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves state-of-the-art performance while being 6ร faster and using 50% fewer parameters than previous methods.
๐ Links: ๐ Paper | ๐ป Code | ๐ฎ Demo | ๐ Project Page
Key Features
- ๐ฏ Balanced Quality & Efficiency: SOTA separation quality without iterative refinement
- ๐ฌ DP-LipCoder: Lightweight video encoder with discrete audio-aligned semantic tokens
- ๐ Global-Local Attention: Multi-scale attention for long-range context and fine-grained details
- ๐ Edge-Friendly: >50% parameter reduction, >2.4ร lower MACs, >6ร faster inference
Performance
VoxCeleb2 Benchmark:
| Metric | Value |
|---|---|
| SI-SNRi | 16.1 dB |
| SDRi | 16.3 dB |
| PESQ | 3.45 |
| ESTOI | 0.93 |
| Parameters | 51.3M (vs 112M in IIANet) |
| MACs | 417G (vs 1009G in IIANet) |
| Inference Speed | 0.015s/4s-clip (vs 0.100s in IIANet) |
Quick Start
Installation
pip install torch torchvision torchaudio
pip install huggingface_hub
Inference Example
import torch
from huggingface_hub import hf_hub_download
import yaml
# Download model and config
config_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="conf.yml")
model_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="best_model.pth")
# Load model (you need to import Dolphin class from the repo)
with open(config_path) as f:
config = yaml.safe_load(f)
model = Dolphin(**config['model'])
model.load_state_dict(torch.load(model_path, map_location='cpu'))
model.eval()
# Prepare inputs
# audio: [batch, samples] - 16kHz audio
# video: [batch, frames, 1, height, width] - grayscale lip frames
audio_mixture = torch.randn(1, 64000) # 4 seconds at 16kHz
video_frames = torch.randn(1, 100, 1, 88, 88) # 4s at 25fps, 88x88 resolution
# Separate speech
with torch.no_grad():
separated_audio = model(audio_mixture, video_frames)
Complete Pipeline with Video Input
For end-to-end video processing with face detection and tracking, see our inference script:
git clone https://github.com/JusperLee/Dolphin.git
cd Dolphin
python inference.py \
--input video.mp4 \
--output ./output \
--speakers 2 \
--config checkpoints/vox2/conf.yml
Model Architecture
Components
DP-LipCoder (Video Encoder)
- Dual-path architecture: visual compression + semantic encoding
- Vector quantization for discrete lip semantic tokens
- Knowledge distillation from AV-HuBERT
- Only 8.5M parameters
Audio Encoder
- Convolutional encoder for time-frequency representation
- Extracts multi-scale acoustic features
Global-Local Attention Separator
- Single-pass TDANet-based architecture
- Global Attention (GA): Coarse-grained self-attention for long-range dependencies
- Local Attention (LA): Heat diffusion attention for noise suppression
- No iterative refinement needed
Audio Decoder
- Reconstructs separated waveform from enhanced features
Input/Output Specifications
Inputs:
audio: Mixed audio waveform, shape[batch, samples], 16kHz sampling ratevideo: Grayscale lip region frames, shape[batch, frames, 1, 88, 88], 25fps
Output:
separated_audio: Separated target speech, shape[batch, samples], 16kHz
Training Details
- Dataset: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
- Training: ~200K steps with Adam optimizer
- Augmentation: Random mixing, noise addition, video frame dropout
- Loss: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
Use Cases
- ๐ง Hearing Aids: Camera-based speech enhancement
- ๐ผ Video Conferencing: Noise suppression with visual context
- ๐ In-Car Assistants: Driver speech extraction
- ๐ฅฝ AR/VR: Immersive communication in noisy environments
- ๐ฑ Edge Devices: Efficient deployment on mobile/embedded systems
Limitations
- Requires frontal or near-frontal face view for optimal performance
- Works best with 25fps video input
- Trained on English speech (may need fine-tuning for other languages)
- Performance degrades with severe occlusions or low lighting
Citation
@misc{li2025dolphin,
title={Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention},
author={Kai Li and Kejun Gao and Xiaolin Hu},
year={2025},
eprint={2509.23610},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2509.23610}
}
License
Apache-2.0 License. See LICENSE for details.
Acknowledgments
Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face team for hosting!
Contact
- ๐ง Email: tsinghua.kaili@gmail.com
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
Developed by the Audio and Speech Group at Tsinghua University ๐
- Downloads last month
- 27,244