VibeVoice-Hindi-1.5B

Model Description

VibeVoice-Hindi-1.5B is a frontier text-to-speech model specifically fine-tuned for Hindi language synthesis. This model is built upon the VibeVoice-1.5B architecture and has been adapted to generate high-quality, natural, and expressive Hindi speech from text input.

VibeVoice represents a breakthrough in TTS technology, capable of generating long-form, multi-speaker conversational audio such as podcasts and dialogues. This Hindi variant extends these capabilities to one of the world's most widely spoken languages.

Base Model

Base Architecture: vibevoice/VibeVoice-1.5B
LLM Backbone: Qwen2.5-1.5B
Tokenizers: Acoustic (σ-VAE) + Semantic tokenizers @ 7.5 Hz
Diffusion Head: ~600M parameters for high-fidelity acoustic generation

Fine-tuning Details

Target Language: Hindi
Method: LoRA adapters on LLM + full fine-tuning of diffusion head
Training Strategy: Curriculum learning with increasing sequence lengths

Usage

Demo and Inference Code

For complete inference examples and demos, please refer to:

Community Repository: vibevoice-community/VibeVoice
ComfyUI Integration: Enemyx-net/VibeVoice-ComfyUI

Hindi Inference

Using with VibeVoice Inference Pipeline

# Clone the community repository
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice

# Install dependencies
uv pip install -e .

With voice cloning (Recommended for Hindi models):

python demo/inference_from_file.py \
--model_path "tarun7r/vibevoice-hindi-1.5B" \
--txt_path "./example_hindi_script.txt" \
--speaker_names hi-Priya_woman \
--seed 42

With multiple speakers:

python demo/inference_from_file.py \
--model_path "tarun7r/vibevoice-hindi-1.5B" \
--txt_path "./example_hindi_script.txt" \
--speaker_names "Speaker1" "Speaker2" \
--cfg_scale 1.3

Note: For voice cloning, ensure you have corresponding voice files in demo/voices/ directory. The script will automatically map speaker names to voice files.

Using the dedicated Hindi inference script (Recommended):

python inference_hindi_vibevoice.py \
--model_path "tarun7r/vibevoice-hindi-1.5B" \
--txt_path "./example_hindi_script.txt"

Key points for Hindi inference:

With voice cloning: Specify --speaker_names to map speakers to voice files
Use --model_path "tarun7r/vibevoice-hindi-1.5B" to match your model
The model will use the provided voice samples for generation
Voice samples are loaded and used for voice cloning

Voice Cloning Setup:

Place voice sample files in demo/voices/ directory
Required file: hi-Priya_woman.wav - Hindi female voice sample
Use descriptive filenames like hindi-speaker1.wav, hindi-speaker2.wav
The script will automatically map speaker names to voice files
Voice cloning works best with high-quality, clear voice samples

Model Architecture Compatibility:

Ensure your model matches the size (1.5B model requires --model_path "tarun7r/vibevoice-hindi-1.5B")

Hindi Inference with Gradio Demo

For interactive Hindi text generation with a model:

Launch the Gradio Demo:

python demo/gradio_demo.py \
--model_path "tarun7r/vibevoice-hindi-1.5B" \
--device cuda

Using the Web Interface:

Enter your Hindi script in the text area
Select speakers (use hi-Priya_woman for Hindi voice)
Click "🚀 Generate Podcast"

Key points:

Voice samples are loaded and used for voice cloning
The model will use the provided voice samples for generation
Real-time streaming audio generation is supported
Works with both 1.5B and 7B models (ensure model matches size)
Make sure hi-Priya_woman.wav is in the demo/voices/ directory

Demo

Sample Output:

Important Note: The quality of the generated audio depends heavily on the reference voice file you provide in the demo/voices/ directory. For best results:

Use high-quality, clear voice samples
Ensure the reference voice matches the desired speaking style
Longer reference samples (10-30 seconds) generally produce better results
The voice characteristics of the reference sample will be transferred to the generated speech

Model Capabilities

Text-to-Speech: Convert Hindi text to natural-sounding speech
Multi-speaker Support: Generate speech with multiple distinct speakers
Long-form Audio: Synthesize extended audio sequences (up to 90 minutes)
Expressive Speech: Maintain natural prosody and intonation for Hindi

Training Details

This model was fine-tuned using:

Technique: LoRA with rank decomposition
Components Trained:
- LoRA adapters on the LLM backbone
- Full fine-tuning of diffusion head
- Connector modules for acoustic and semantic features

Responsible Usage

Direct intended uses

The VibeVoice model is limited to research purpose use exploring highly realistic audio dialogue generation detailed in the tech report.

Out-of-scope uses

Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by MIT License. Use to generate any text transcript. Furthermore, this release is not intended or licensed for any of the following scenarios:

Voice impersonation without explicit, recorded consent – cloning a real individual's voice for satire, advertising, ransom, social‑engineering, or authentication bypass.
Disinformation or impersonation – creating audio presented as genuine recordings of real people or events.
Real‑time or low‑latency voice conversion – telephone or video‑conference "live deep‑fake" applications.
Unsupported language – the model is trained only on English and Chinese data; outputs in other languages are unsupported and may be unintelligible or offensive.
Generation of background ambience, Foley, or music – VibeVoice is speech‑only and will not produce coherent non‑speech audio.

Risks and limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model. Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content. English and Chinese only: Transcripts in language other than English or Chinese may result in unexpected audio outputs. Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects. Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.

Recommendations

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

To mitigate the risks of misuse, we have: Embedded an audible disclaimer (e.g. "This segment was generated by AI") automatically into every synthesized audio file. Added an imperceptible watermark to generated audio so third parties can verify VibeVoice provenance. Please see contact information at the end of this model card. Logged inference requests (hashed) for abuse pattern detection and publishing aggregated statistics quarterly. Users are responsible for sourcing their datasets legally and ethically. This may include securing appropriate rights and/or anonymizing data prior to use with VibeVoice. Users are reminded to be mindful of data privacy concerns.

License & Redistribution Notice

This model is released under the MIT License, consistent with the base VibeVoice model.

Redistribution Notice: This repository contains model weights derived from microsoft/VibeVoice-1.5B, which is licensed under the MIT License. The MIT License permits redistribution and derivative works.

My understanding of the MIT License, which is consistent with the broader open-source community's consensus, is that it grants the right to distribute copies of the software and its derivatives. Therefore, I am lawfully exercising the right to redistribute this model.

If you are a rights holder and believe this understanding of the license is incorrect, please submit a DMCA complaint to Hugging Face at dmca@huggingface.co

Acknowledgments

Base Model: Microsoft Research for the original VibeVoice model
Fine-tuning Code: vibevoice-community/VibeVoice for the training framework
Training Infrastructure: Nebius H100 GPU cluster
Community: Hugging Face and the open-source AI community
Framework: Built on Qwen2.5, Transformers, and PEFT libraries

Contact

Actively seeking opportunities as an ML Engineer II / Data Scientist II

For questions, issues, or collaboration:

GitHub: tarun7r
LinkedIn: Tarun Sai Goddu
Hugging Face: tarun7r
Base model contact: VibeVoice@microsoft.com

Key Projects

SpeechAlgo - Comprehensive Speech Processing Algorithms Library
Vocal-Agent - Cascading voice assistant with real-time speech recognition
Finance-Llama-8B - Financial domain fine-tuned Llama model

Note: This is a research model. Please use responsibly and in compliance with applicable laws and ethical guidelines.

Downloads last month: 162

Safetensors

Model size

3B params

Tensor type

F32

Model tree for tarun7r/vibevoice-hindi-1.5B

Base model

vibevoice/VibeVoice-1.5B

Finetuned

(3)

this model