Transformers documentation
VibeVoice Acoustic Tokenizer
This model was released on 2025-08-26 and added to Hugging Face Transformers on 2026-02-05.
VibeVoice Acoustic Tokenizer
Overview
VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It’s designed to capture the authentic conversational “vibe” and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.
One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.
A model checkpoint is available at microsoft/VibeVoice-AcousticTokenizer
This model was contributed by Eric Bezzam.
Architecture
The architecture is a mirror-symmetric encoder-decoder structure. The encoder employs a hierarchical design with 7 stages of ConvNeXt-like blocks, which use 1D depth-wise causal convolutionsfor efficient streaming processing. Six downsampling layers achieve a cumulative 3200X downsampling rate from a 24kHz input, yielding 7.5 tokens/frames per second. Each encoder/decoder component has approximately 340M parameters, for a total of around 680M parameters The training objective follows that of DAC, including its discriminator and loss designs.
Acoustic Tokenizer adopts the principles of a Variational Autoencoder (VAE). The encoder maps the input audio to the parameters of a latent distribution, namely the mean. Along with a fixed standard deviation, a latent vector is then sampled using the reparameterization trick. Please refer to the technical report for further details.
Usage
Below is example usage to encode and decode audio:
import torch
from scipy.io import wavfile
from transformers import AutoFeatureExtractor, VibeVoiceAcousticTokenizerModel
from transformers.audio_utils import load_audio_librosa
model_id = "microsoft/VibeVoice-AcousticTokenizer"
# load model
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = VibeVoiceAcousticTokenizerModel.from_pretrained(model_id, device_map="auto")
print("Model loaded on device:", model.device)
print("Model dtype:", model.dtype)
# load audio
audio = load_audio_librosa(
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav",
sampling_rate=feature_extractor.sampling_rate,
)
# preprocess audio
inputs = feature_extractor(
audio,
sampling_rate=feature_extractor.sampling_rate,
pad_to_multiple_of=3200,
).to(model.device, model.dtype)
print("Input audio shape:", inputs.input_values.shape)
# Input audio shape: torch.Size([1, 1, 224000])
with torch.no_grad():
# set VAE sampling to False for deterministic output
encoded_outputs = model.encode(inputs.input_values, sample=False)
print("Latent shape:", encoded_outputs.latents.shape)
# Latent shape: torch.Size([1, 70, 64])
decoded_outputs = model.decode(**encoded_outputs)
print("Reconstructed audio shape:", decoded_outputs.audio.shape)
# Reconstructed audio shape: torch.Size([1, 1, 224000])
# Save audio
output_fp = "vibevoice_acoustic_tokenizer_reconstructed.wav"
wavfile.write(output_fp, feature_extractor.sampling_rate, decoded_outputs.audio.squeeze().float().cpu().numpy())
print(f"Reconstructed audio saved to : {output_fp}")Streaming
For streaming ASR or TTS, where cached states need to be tracked, the use_cache parameter can be used when encoding or decoding audio:
import torch
from scipy.io import wavfile
from transformers import AutoFeatureExtractor, VibeVoiceAcousticTokenizerModel
from transformers.audio_utils import load_audio_librosa
model_id = "microsoft/VibeVoice-AcousticTokenizer"
# load model
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = VibeVoiceAcousticTokenizerModel.from_pretrained(model_id, device_map="auto")
print("Model loaded on device:", model.device)
print("Model dtype:", model.dtype)
# load audio
audio = load_audio_librosa(
"https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav",
sampling_rate=feature_extractor.sampling_rate,
)
# preprocess audio
inputs = feature_extractor(
audio,
sampling_rate=feature_extractor.sampling_rate,
pad_to_multiple_of=3200,
).to(model.device, model.dtype)
print("Input audio shape:", inputs.input_values.shape)
# Input audio shape: torch.Size([1, 1, 224000])
# chache will be initialized after a first pass
encoder_cache = None
decoder_cache = None
with torch.no_grad():
# set VAE sampling to False for deterministic output
encoded_outputs = model.encode(inputs.input_values, sample=False, padding_cache=encoder_cache, use_cache=True)
print("Latent shape:", encoded_outputs.latents.shape)
# Latent shape: torch.Size([1, 70, 64])
decoded_outputs = model.decode(encoded_outputs.latents, padding_cache=decoder_cache, use_cache=True)
print("Reconstructed audio shape:", decoded_outputs.audio.shape)
# Reconstructed audio shape: torch.Size([1, 1, 224000])
# `padding_cache` can be extracted from the outputs for subsequent passes
encoder_cache = encoded_outputs.padding_cache
print("Number of cached encoder layers:", len(encoder_cache.per_layer_in_channels))
# Number of cached encoder layers: 34
decoder_cache = decoded_outputs.padding_cache
print("Number of cached decoder layers:", len(decoder_cache.per_layer_in_channels))
# Number of cached decoder layers: 34
# Save audio
output_fp = "vibevoice_acoustic_tokenizer_reconstructed.wav"
wavfile.write(output_fp, feature_extractor.sampling_rate, decoded_outputs.audio.squeeze().float().cpu().numpy())
print(f"Reconstructed audio saved to : {output_fp}")VibeVoiceAcousticTokenizerConfig
class transformers.VibeVoiceAcousticTokenizerConfig
< source >( channels = 1 hidden_size = 64 kernel_size = 7 rms_norm_eps = 1e-05 layer_scale_init_value = 1e-06 initializer_range = 0.01 num_filters = 32 downsampling_ratios = [2, 2, 4, 5, 5, 8] depths = [3, 3, 3, 3, 3, 3, 8] hidden_act = 'gelu' ffn_expansion = 4 vae_std = 0.625 **kwargs )
Parameters
- channels (
int, optional, defaults to 1) — Number of input channels. - hidden_size (
int, optional, defaults to 64) — Dimensionality of latent representations. - kernel_size (
int, optional, defaults to 7) — Kernel size for convolutional layers. - rms_norm_eps (
float, optional, defaults to 1e-05) — Epsilon value for RMSNorm layers. - layer_scale_init_value (
float, optional, defaults to 1e-06) — Initial value for layer scaling. - initializer_range (
float, optional, defaults to 0.01) — Standard deviation for weight initialization. - num_filters (
int, optional, defaults to 32) — Number of filters in initial convolutional layer, and doubles after each downsampling. - downsampling_ratios (
List[int], optional, defaults to[2, 2, 4, 5, 5, 8]) — Downsampling ratios for each layer. - depths (
List[int], optional, defaults to[3, 3, 3, 3, 3, 3, 8]) — Number of ConvNeXt blocks at each stage. - hidden_act (
str, optional, defaults to"gelu") — Activation function to use. - ffn_expansion (
int, optional, defaults to 4) — Expansion factor for feed-forward networks. - vae_std (
float, optional, defaults to 0.625) — Standard deviation used for VAE sampling after encoder.
This is the configuration class to store the configuration of a VibeVoiceAcousticTokenizerModel. It is used to instantiate a VibeVoice acoustic tokenizer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration of the acoustic tokenizer within the VibeVoice architecture.
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Example:
>>> from transformers import VibeVoiceAcousticTokenizerModel, VibeVoiceAcousticTokenizerConfig
>>> # Initializing a VibeVoice Acoustic Tokenizer configuration
>>> configuration = VibeVoiceAcousticTokenizerConfig()
>>> # Initializing a model (with random weights)
>>> model = VibeVoiceAcousticTokenizerModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configVibeVoiceAcousticTokenizerFeatureExtractor
class transformers.VibeVoiceAcousticTokenizerFeatureExtractor
< source >( feature_size = 1 sampling_rate = 24000 padding_value = 0.0 normalize_audio = True target_dB_FS = -25 eps = 1e-06 **kwargs )
Parameters
- feature_size (
int, optional, defaults to 1) — The number of channels. - sampling_rate (
int, optional, defaults to 24000) — The sampling rate at which the audio waveform should be digitalized, expressed in hertz (Hz). - padding_value (
float, optional, defaults to 0.0) — The value that is used for padding. - normalize_audio (
bool, optional, defaults toTrue) — Whether to normalize audio to a target dB FS. - target_dB_FS (
float, optional, defaults to -25) — Target dB FS for normalization. - eps (
float, optional, defaults to 1e-06) — A small value to avoid division by zero when normalizing.
Constructs a VibeVoiceAcousticTokenizer feature extractor.
__call__
< source >( audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor']] sampling_rate: int | None = None padding: bool | str | transformers.utils.generic.PaddingStrategy | None = True pad_to_multiple_of: int | None = None return_attention_mask: bool | None = True )
Parameters
- audio (
np.ndarray,torch.Tensor,list[np.ndarray],list[torch.Tensor]— The sequence or batch of sequences to be processed. Each sequence can be a numpy array, a torch tensor, a list of numpy arrays or a list of torch tensors. - sampling_rate (
int, optional) — The sampling rate at which theaudioinput was sampled. It is strongly recommended to passsampling_rateat the forward call to prevent silent errors. - padding (
bool,stror PaddingStrategy, optional, defaults toTrue) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:Trueor'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length': Pad to a maximum length specified with the argumentmax_lengthor to the maximum acceptable input length for the model if that argument is not provided.Falseor'do_not_pad'(default): No padding (i.e., can output a batch with sequences of different lengths).
- pad_to_multiple_of (
int, optional) — If set will pad the sequence to a multiple of the provided value.
VibeVoiceAcousticTokenizerModel
class transformers.VibeVoiceAcousticTokenizerModel
< source >( config )
Parameters
- config (VibeVoiceAcousticTokenizerModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
VibeVoice acoustic tokenizer with an encoder and decoder for continuous acoustic tokens.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
encode
< source >( input_values padding_cache = None use_cache = None sample = True )
Parameters
- input_values (
torch.FloatTensorof shape(batch_size, channels, sequence_length)) — Input audio waveform to be encoded into latent representation. - padding_cache (
VibeVoiceAcousticTokenizerConv1dPaddingCache, optional) — Cache object for streaming mode to maintain convolution states across layers. - use_cache (
bool, optional) — Whether to use caching for convolution states. - sample (
bool, optional) — Whether to sample from the VAE. If False, no noise is added.
decode
< source >( latents padding_cache = None use_cache = False )
Parameters
- latents (
torch.FloatTensorof shape(batch_size, channels, sequence_length)) — Input latent representation to be decoded back into audio. - padding_cache (
VibeVoiceAcousticTokenizerConv1dPaddingCache, optional) — Cache object for streaming mode to maintain convolution states across layers. - use_cache (
bool, optional) — Whether to use caching for convolution states.
forward
< source >( input_values padding_cache = None use_cache = False sample = True **kwargs )
Parameters
- input_values (
torch.FloatTensorof shape(batch_size, channels, sequence_length)) — Input audio waveform to be encoded into latent representation. - padding_cache (
VibeVoiceAcousticTokenizerConv1dPaddingCache, optional) — Cache object for streaming mode to maintain convolution states across layers. Note only used by decoder. - use_cache (
bool, optional) — Whether to use caching for convolution states. - sample (
bool, optional) — Whether to sample from the VAE latent distribution. If False, no noise is added to the latents.
The VibeVoiceAcousticTokenizerModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.