Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,104 +1,14 @@
|
|
| 1 |
-

|
| 2 |
-
|
| 3 |
-
<div align="center">
|
| 4 |
-
<a href="https://huggingface.co/nari-labs/Dia2-2B"><img src="https://img.shields.io/badge/HF%20Repo-Dia2--2B-orange?style=for-the-badge"></a>
|
| 5 |
-
<a href="https://discord.gg/bJq6vjRRKv"><img src="https://img.shields.io/badge/Discord-Join%20Chat-7289DA?logo=discord&style=for-the-badge"></a>
|
| 6 |
-
<a href="https://github.com/nari-labs/dia2/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg?style=for-the-badge"></a>
|
| 7 |
-
</div>
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
**Dia2** is a **streaming dialogue TTS model** created by Nari Labs.
|
| 11 |
-
|
| 12 |
-
The model does not need the entire text to produce the audio, and can start generating as the first few words are given as input. You can condition the output on audio, enabling natural conversations in realtime.
|
| 13 |
-
|
| 14 |
-
We provide model checkpoints (1B, 2B) and inference code to accelerate research. The model only supports up to 2 minutes of generation in English.
|
| 15 |
-
|
| 16 |
-
⚠️ Quality and voices vary per generation, as the model is not fine-tuned on a specific voice. Use with prefix or fine-tune in order to obtain stable output.
|
| 17 |
-
|
| 18 |
-
## Upcoming
|
| 19 |
-
|
| 20 |
-
- Bonsai (JAX) implementation
|
| 21 |
-
- Dia2 TTS Server: Real streaming support
|
| 22 |
-
- Sori: Dia2-powered speech-to-speech engine written in Rust
|
| 23 |
-
|
| 24 |
-
## Quickstart
|
| 25 |
-
|
| 26 |
-
> **Requirement** — install [uv](https://docs.astral.sh/uv/) and use CUDA 12.8+
|
| 27 |
-
> drivers. All commands below run through `uv run …` as a rule.
|
| 28 |
-
|
| 29 |
-
1. **Install dependencies (one-time):**
|
| 30 |
-
```bash
|
| 31 |
-
uv sync
|
| 32 |
-
```
|
| 33 |
-
2. **Prepare a script:** edit `input.txt` using `[S1]` / `[S2]` speaker tags.
|
| 34 |
-
3. **Generate audio:**
|
| 35 |
-
```bash
|
| 36 |
-
uv run -m dia2.cli \
|
| 37 |
-
--hf nari-labs/Dia2-2B \
|
| 38 |
-
--input input.txt \
|
| 39 |
-
--cfg 6.0 --temperature 0.8 \
|
| 40 |
-
--cuda-graph --verbose \
|
| 41 |
-
output.wav
|
| 42 |
-
```
|
| 43 |
-
The first run downloads weights/tokenizer/Mimi. The CLI auto-selects CUDA when available (otherwise CPU) and defaults to bfloat16 precision—override with `--device` / `--dtype` if needed.
|
| 44 |
-
4. **Conditional Generation (recommended for stable use):**
|
| 45 |
-
```bash
|
| 46 |
-
uv run -m dia2.cli \
|
| 47 |
-
--hf nari-labs/Dia2-2B \
|
| 48 |
-
--input input.txt \
|
| 49 |
-
--prefix-speaker-1 example_prefix1.wav \
|
| 50 |
-
--prefix-speaker-2 example_prefix2.wav \
|
| 51 |
-
--cuda-graph --verbose \
|
| 52 |
-
output_conditioned.wav
|
| 53 |
-
```
|
| 54 |
-
Condition the generation on previous conversational context in order to generate natural output for your speech-to-speech system. For example, place the voice of your assistant as prefix speaker 1, place user's audio input as prefix speaker 2, and generate the response to user's input.
|
| 55 |
-
|
| 56 |
-
Whisper is used to transcribe each prefix file, which takes additional time. We include example prefix files as `example_prefix1.wav` and `example_prefix2.wav` (both files are output created by the model).
|
| 57 |
-
6. **Gradio for Easy Usage**
|
| 58 |
-
```bash
|
| 59 |
-
uv run gradio_app.py
|
| 60 |
-
```
|
| 61 |
-
|
| 62 |
-
### Programmatic Usage
|
| 63 |
-
```python
|
| 64 |
-
from dia2 import Dia2, GenerationConfig, SamplingConfig
|
| 65 |
-
|
| 66 |
-
dia = Dia2.from_repo("nari-labs/Dia2-2B", device="cuda", dtype="bfloat16")
|
| 67 |
-
config = GenerationConfig(
|
| 68 |
-
cfg_scale=2.0,
|
| 69 |
-
audio=SamplingConfig(temperature=0.8, top_k=50),
|
| 70 |
-
use_cuda_graph=True,
|
| 71 |
-
)
|
| 72 |
-
result = dia.generate("[S1] Hello Dia2!", config=config, output_wav="hello.wav", verbose=True)
|
| 73 |
-
```
|
| 74 |
-
Generation runs until the runtime config's `max_context_steps` (1500, 2 minutes)
|
| 75 |
-
or until EOS is detected. `GenerationResult` includes audio tokens, waveform tensor,
|
| 76 |
-
and word timestamps relative to Mimi’s ~12.5 Hz frame rate.
|
| 77 |
-
|
| 78 |
-
## Hugging Face
|
| 79 |
-
|
| 80 |
-
| Variant | Repo |
|
| 81 |
-
| --- | --- |
|
| 82 |
-
| Dia2-1B | [`nari-labs/Dia2-1B`](https://huggingface.co/nari-labs/Dia2-1B)
|
| 83 |
-
| Dia2-2B | [`nari-labs/Dia2-2B`](https://huggingface.co/nari-labs/Dia2-2B)
|
| 84 |
-
|
| 85 |
-
## License & Attribution
|
| 86 |
-
|
| 87 |
-
Licensed under [Apache 2.0](LICENSE). All third-party assets (Kyutai Mimi codec, etc.) retain their original licenses.
|
| 88 |
-
|
| 89 |
-
## Disclaimer
|
| 90 |
-
|
| 91 |
-
This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are **strictly forbidden**:
|
| 92 |
-
|
| 93 |
-
- **Identity Misuse**: Do not produce audio resembling real individuals without permission.
|
| 94 |
-
- **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
|
| 95 |
-
- **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm.
|
| 96 |
-
|
| 97 |
-
By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.
|
| 98 |
-
|
| 99 |
-
## Acknowledgements
|
| 100 |
-
- We thank the [TPU Research Cloud](https://sites.research.google/trc/about/) program for providing compute for training.
|
| 101 |
-
- Our work was heavily inspired by [KyutaiTTS](https://kyutai.org/next/tts) and [Sesame](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice)
|
| 102 |
-
|
| 103 |
---
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Dia2 2B
|
| 3 |
+
emoji: 💨
|
| 4 |
+
colorFrom: yellow
|
| 5 |
+
colorTo: gray
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 4.41.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
short_description: Streaming conversational audio in realtime
|
| 11 |
+
disable_embedding: true
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
Check out the configuration reference at <https://huggingface.co/docs/hub/spaces-config-reference>
|