YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SPEAR Large (speech + general audio)

This is the SPEAR Large dual-domain (speech + general audio) model. The model adopts a Zipformer backbone with 327M parameters consisting of 112 Zipformer stacks. It generates 1024-dimensional representations at approximately 50~Hz.

This model was pre-trained on 97k hours of mixture data of English speech and general audio, among which 84k hours are speech data, and the remaining 13k hours are general audio data. It achieves high performance on SUPERB benchmark (for speech representation evaluation) and on HEAR benchmark (for audio representation evaluation).

The speech data consists of the following datasets:

Dataset Duration (hours)
Libriheavy ~50k
Gigaspeech ~10k
VoxPopuli (en) ~24k

The audio data consists of the following datasets:

Dataset Duration (hours)
AudioSet ~5k
Freesound ~2.8k
Music4all ~1k
VGGSound ~0.5k
MTG-Jamendo ~3.8k

Note: The model is pretrained on 16kHz sampled speech/audio data. When using the model make sure that your input is also sampled at 16kHz.

Paper

Authors: Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland

Abstract Self-Supervised Learning (SSL) excels at learning generic representations of acoustic signals, yet prevailing methods remain domain-specific, tailored to either speech or general audio, hindering the development of a unified representation model with a comprehensive capability over both domains. To address this, we present SPEAR (SPEech and Audio Representations), the first SSL framework to successfully learn unified speech and audio representations from a mixture of speech and audio data. SPEAR proposes a unified pre-training objective based on masked prediction of fine-grained discrete tokens for both speech and general audio. These tokens are derived from continuous speech and audio representations using a Multi-codebook Vector Quantisation (MVQ) method, retaining rich acoustic detail essential for modelling both speech and complex audio events. SPEAR is applied to pre-train both single-domain and unified speech-and-audio SSL models. Our speech-domain model establishes a new state-of-the-art on the SUPERB benchmark, a speech processing benchmark for SSL models, matching or surpassing the highly competitive WavLM Large on 12 out of 15 tasks with the same pre-training corpora and a similar model size. Crucially, our unified model learns complementary features and demonstrates comprehensive capabilities across two major benchmarks, SUPERB and HEAR, for evaluating audio representations. By further scaling up the model size and pre-training data, we present a unified model with 600M parameters that excels in both domains, establishing it as one of the most powerful and versatile open-source SSL models for auditory understanding.

Usage

This model is pre-trained purely using unlabelled data. Therefore, it requires fine-tuning with labelled data for downstream tasks such as automatic speech recognition (ASR) or audio tagging (AT).

The model achieves the following word error rates (WERs) when fine-tuned on LibriSpeech for ASR:

Fine-tuning data test-clean test-other
LS960 1.7 3.4

The model acheives the following mean average precision (mAP) when fine-tuned on AudioSet for AT:

Fine-tuning data mAP
AudioSet Balanced 39.2
AudioSet Full 49.6

You can extract its top-layer feature (and intermediate hidden states) using the following code:

from transformers import AutoModel
import torch

model = AutoModel.from_pretrained(
    "marcoyang/spear-large-speech-audio", 
    trust_remote_code=True,
    force_download=False,
)
if torch.cuda.is_available():
    model = model.to("cuda")
model.eval()

device = next(model.parameters()).device
audio = torch.randn(1, 160000).to(device) # dummy audio input of 10 seconds
audio_len = torch.tensor([160000]).to(device)

with torch.no_grad():
    outputs = model(audio, audio_len)

encoder_out = outputs["encoder_out"] # (N,T,C)
encoder_out_lens = outputs["encoder_out_lens"] # (N)
middle_out = outputs["hidden_states"] # list of (N,T,C)

print(encoder_out)
print(encoder_out_lens)
print(len(middle_out)) # 11 layers
print(middle_out[-1].shape)
print(middle_out[-1])
Downloads last month
34
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including marcoyang/spear-large-speech-audio