---
license: apache-2.0
datasets:
- amphion/Emilia-Dataset
language:
- en
- zh
tags:
- speech-synthesis
- pytorch

author:
  - name: Kangxiang Xia  # 作者/团队名称（如你的实验室：西北工业大学音频语音与语言处理组）
    email: xkx@mail.nwpu.edu.cn  # 可选，方便协作联系
  
organization:
  - name: ASLP@NPU  # 所属机构
    url: http://www.npu-aslp.org/  # 机构官网

links:
  - name: Paper  # 论文链接（如有，填 arXiv 或期刊地址）
    url: https://arxiv.org/abs/2412.16846 # 示例：arXiv 链接
  - name: GitHub Repo  # 若有额外代码仓库，填链接
    url: https://github.com/xkx-hub/KALL-E
  - name: Demo Page  # 在线演示页面（如你的网页链接）
    url: https://nwpu-aslp.feishu.cn/wiki/TfLEwoITwiTReakgfnPczGfunzh

---

# 🎙️ KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction

[![Project Page](https://img.shields.io/badge/Project%20Page-GitHub-blue)](https://github.com/xkx-hub/KALL-E)  [![arXiv](https://img.shields.io/badge/arXiv-2410.06885-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2412.16846) [![Demo](https://img.shields.io/badge/Demo%20Page-blue)](https://nwpu-aslp.feishu.cn/wiki/TfLEwoITwiTReakgfnPczGfunzh?from=from_copylink)

## News

* [2025.08.05] 🔥 🔥 🔥  We release the inference code of [KALL-E](https://github.com/xkx-hub/KALL-E)!
* [2025.09.17] 🎉 🎉 🎉  KALL-E's paper is updated on [arxiv](https://arxiv.org/abs/2412.16846), read it now! 


## Overview
This repository contains the inference utilities for **KALL-E**, a text-to-speech system that predicts continuous speech representations using a single autoregressive language model.

![System Overview](./figures/kalle-architecture.jpg)

- **Autoregressive Language Modeling**: Utilizes an autoregressive approach for next-distribution prediction in text-to-speech synthesis.
- **Continuous Speech Distribution**: Directly models and predicts continuous speech distributions conditioned on text, avoiding reliance on diffusion-based components.
- **FlowVAE**: Employs FlowVAE to extract continuous speech distributions from waveforms, rather than using discrete speech tokens.
- **Single AR Language Model**: Uses a single autoregressive language model to predict continuous speech distributions from text, constrained by Kullback-Leibler divergence loss.
- **Simplified Paradigm**: Offers a more straightforward and effective approach for using continuous speech representations in TTS.

## Key Features

- **Random Speaker Voices** - 
    When no speaker prompt is provided, the model is able to generate random voices, either female or male.

- **⚡ Blazing-fast Synthesis**
    Generate up to 5 seconds of audio with a single click in the web UI.

- **Context-aware Synthesis**
    KALL-E excels in generating expressive, context-aware speech, showcasing its ability to handle complex linguistic and emotional features with ease. 

## Environment Setup

- Python>=3.9 or higher
- PyTorch with CUDA support
- Transformers==4.49.0
- NumPy
- SciPy
- alias-free-torch

Then you can clone the code for github:
```bash
git clone https://github.com/xkx-hub/KALL-E.git
cd KALL-E
```

## Usage

### 1. Model Download

You need download the model in advance and place them like this:


```bash
KALL-E
|    ckpt
|    | - flowvae.pt
|    | - model.pt
|    ......
|    model.py
|    infer.py
```


### 2. Unconditional generation

```bash
python infer.py --target_text "<ka li E> is a text-to-speech system that predicts continuous speech representations using a single autoregressive language model."
```

### 3. Conditional generation 

```bash

python infer.py \
--target_text "<ka li E> is a text-to-speech system that predicts continuous speech representations using a single autoregressive language model." \
--prompt_text "oh that's crazy!" \
--prompt_wav_path ./test.wav 

```

### 4. Web demo

```bash
python web.py
```