You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🎙️ KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction

News

[2025.08.05] 🔥 🔥 🔥 We release the inference code of KALL-E!
[2025.09.17] 🎉 🎉 🎉 KALL-E's paper is updated on arxiv, read it now!

Overview

This repository contains the inference utilities for KALL-E, a text-to-speech system that predicts continuous speech representations using a single autoregressive language model.

Autoregressive Language Modeling: Utilizes an autoregressive approach for next-distribution prediction in text-to-speech synthesis.
Continuous Speech Distribution: Directly models and predicts continuous speech distributions conditioned on text, avoiding reliance on diffusion-based components.
FlowVAE: Employs FlowVAE to extract continuous speech distributions from waveforms, rather than using discrete speech tokens.
Single AR Language Model: Uses a single autoregressive language model to predict continuous speech distributions from text, constrained by Kullback-Leibler divergence loss.
Simplified Paradigm: Offers a more straightforward and effective approach for using continuous speech representations in TTS.

Key Features

Random Speaker Voices - When no speaker prompt is provided, the model is able to generate random voices, either female or male.
⚡ Blazing-fast Synthesis Generate up to 5 seconds of audio with a single click in the web UI.
Context-aware Synthesis KALL-E excels in generating expressive, context-aware speech, showcasing its ability to handle complex linguistic and emotional features with ease.

Environment Setup

Python>=3.9 or higher
PyTorch with CUDA support
Transformers==4.49.0
NumPy
SciPy
alias-free-torch

Then you can clone the code for github:

git clone https://github.com/xkx-hub/KALL-E.git
cd KALL-E

Usage

1. Model Download

You need download the model in advance and place them like this:

KALL-E
|    ckpt
|    | - flowvae.pt
|    | - model.pt
|    ......
|    model.py
|    infer.py

2. Unconditional generation

python infer.py --target_text "<ka li E> is a text-to-speech system that predicts continuous speech representations using a single autoregressive language model."

3. Conditional generation


python infer.py \
--target_text "<ka li E> is a text-to-speech system that predicts continuous speech representations using a single autoregressive language model." \
--prompt_text "oh that's crazy!" \
--prompt_wav_path ./test.wav

4. Web demo

python web.py

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

kxxia
/

KALL-E