Dolphin: Efficient Audio-Visual Speech Separation

Dolphin Logo

Model Overview

Dolphin is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves state-of-the-art performance while being 6ร— faster and using 50% fewer parameters than previous methods.

๐Ÿ”— Links: ๐Ÿ“„ Paper | ๐Ÿ’ป Code | ๐ŸŽฎ Demo | ๐ŸŒ Project Page

Key Features

  • ๐ŸŽฏ Balanced Quality & Efficiency: SOTA separation quality without iterative refinement
  • ๐Ÿ”ฌ DP-LipCoder: Lightweight video encoder with discrete audio-aligned semantic tokens
  • ๐ŸŒ Global-Local Attention: Multi-scale attention for long-range context and fine-grained details
  • ๐Ÿš€ Edge-Friendly: >50% parameter reduction, >2.4ร— lower MACs, >6ร— faster inference

Performance

VoxCeleb2 Benchmark:

Metric Value
SI-SNRi 16.1 dB
SDRi 16.3 dB
PESQ 3.45
ESTOI 0.93
Parameters 51.3M (vs 112M in IIANet)
MACs 417G (vs 1009G in IIANet)
Inference Speed 0.015s/4s-clip (vs 0.100s in IIANet)

Quick Start

Installation

pip install torch torchvision torchaudio
pip install huggingface_hub

Inference Example

import torch
from huggingface_hub import hf_hub_download
import yaml

# Download model and config
config_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="conf.yml")
model_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="best_model.pth")

# Load model (you need to import Dolphin class from the repo)
with open(config_path) as f:
    config = yaml.safe_load(f)

model = Dolphin(**config['model'])
model.load_state_dict(torch.load(model_path, map_location='cpu'))
model.eval()

# Prepare inputs
# audio: [batch, samples] - 16kHz audio
# video: [batch, frames, 1, height, width] - grayscale lip frames
audio_mixture = torch.randn(1, 64000)  # 4 seconds at 16kHz
video_frames = torch.randn(1, 100, 1, 88, 88)  # 4s at 25fps, 88x88 resolution

# Separate speech
with torch.no_grad():
    separated_audio = model(audio_mixture, video_frames)

Complete Pipeline with Video Input

For end-to-end video processing with face detection and tracking, see our inference script:

git clone https://github.com/JusperLee/Dolphin.git
cd Dolphin
python inference.py \
    --input video.mp4 \
    --output ./output \
    --speakers 2 \
    --config checkpoints/vox2/conf.yml

Model Architecture

Components

  1. DP-LipCoder (Video Encoder)

    • Dual-path architecture: visual compression + semantic encoding
    • Vector quantization for discrete lip semantic tokens
    • Knowledge distillation from AV-HuBERT
    • Only 8.5M parameters
  2. Audio Encoder

    • Convolutional encoder for time-frequency representation
    • Extracts multi-scale acoustic features
  3. Global-Local Attention Separator

    • Single-pass TDANet-based architecture
    • Global Attention (GA): Coarse-grained self-attention for long-range dependencies
    • Local Attention (LA): Heat diffusion attention for noise suppression
    • No iterative refinement needed
  4. Audio Decoder

    • Reconstructs separated waveform from enhanced features

Input/Output Specifications

Inputs:

  • audio: Mixed audio waveform, shape [batch, samples], 16kHz sampling rate
  • video: Grayscale lip region frames, shape [batch, frames, 1, 88, 88], 25fps

Output:

  • separated_audio: Separated target speech, shape [batch, samples], 16kHz

Training Details

  • Dataset: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
  • Training: ~200K steps with Adam optimizer
  • Augmentation: Random mixing, noise addition, video frame dropout
  • Loss: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)

Use Cases

  • ๐ŸŽง Hearing Aids: Camera-based speech enhancement
  • ๐Ÿ’ผ Video Conferencing: Noise suppression with visual context
  • ๐Ÿš— In-Car Assistants: Driver speech extraction
  • ๐Ÿฅฝ AR/VR: Immersive communication in noisy environments
  • ๐Ÿ“ฑ Edge Devices: Efficient deployment on mobile/embedded systems

Limitations

  • Requires frontal or near-frontal face view for optimal performance
  • Works best with 25fps video input
  • Trained on English speech (may need fine-tuning for other languages)
  • Performance degrades with severe occlusions or low lighting

Citation

@misc{li2025dolphin,
  title={Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention}, 
  author={Kai Li and Kejun Gao and Xiaolin Hu},
  year={2025},
  eprint={2509.23610},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2509.23610}
}

License

Apache-2.0 License. See LICENSE for details.

Acknowledgments

Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face team for hosting!

Contact


Developed by the Audio and Speech Group at Tsinghua University ๐ŸŽ“

Downloads last month
27,244
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train JusperLee/Dolphin

Space using JusperLee/Dolphin 1