Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Paper
• 2106.06103 • Published
• 4
A VITS text-to-speech model for Sinhala (සිංහල), trained using Coqui TTS.
GitHub: pradeep-sanjaya/sinhala-tts
| Detail | Value |
|---|---|
| Model | VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) |
| Language | Sinhala (සිංහල) |
| Epochs | 300 |
| Final mel loss | ~18.92 |
| Dataset | Multi-speaker TTS Dataset Sinhala |
| GPU | NVIDIA A100-80GB (via Modal) |
| Training time | ~3.2 hours |
| Framework | Coqui TTS 0.27.5 |
from huggingface_hub import hf_hub_download
from TTS.utils.synthesizer import Synthesizer
config_path = hf_hub_download(repo_id="ngpsanjaya/vits-sinhala", filename="config.json")
model_path = hf_hub_download(repo_id="ngpsanjaya/vits-sinhala", filename="model.pth")
synthesizer = Synthesizer(
tts_checkpoint=model_path,
tts_config_path=config_path,
use_cuda=True,
)
wav = synthesizer.tts("ආයුබෝවන්")
import numpy as np
import soundfile as sf
sf.write("output.wav", np.array(wav), synthesizer.tts_config.audio.sample_rate)
The full training pipeline supports Modal, Kaggle, Google Colab, and AWS SageMaker.
See the GitHub repo for:
MIT. Please check the dataset license for data usage terms.