File size: 4,526 Bytes
b5a5615 efc993c b5a5615 efc993c b5a5615 779bac0 fc2de20 779bac0 d4c7516 b5a5615 9053f3d b5a5615 4edc284 b5a5615 4edc284 b5a5615 9053f3d b5a5615 4edc284 b5a5615 3913675 b5a5615 4edc284 b5a5615 b53b19a 1bf48fc b53b19a 4edc284 b53b19a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
language:
- zh
- en
tags:
- llm
- tts
- zero-shot
- voice-cloning
- reinforcement-learning
- flow-matching
license: mit
pipeline_tag: text-to-speech
---
# GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS
<div align="center">
<a href="README.md">
<img src="https://img.shields.io/badge/Language_/_θ―θ¨-English-blue?style=flat-square" alt="English">
</a>
<a href="README_zh.md">
<img src="https://img.shields.io/badge/Language_/_θ―θ¨-δΈζ-red?style=flat-square" alt="Chinese">
</a>
</div>
<br><br>
<div align="center">
<img src="assets/images/logo.svg" width="50%"/>
</div>
<p align="center">
<a href="https://github.com/zai-org/GLM-TTS" target="_blank">π» GitHub Repository</a>
|
<a href="https://huggingface.co/spaces/zai-org/GLM-TTS" target="_blank">π€ Online Demo</a>
|
<a href="https://audio.z.ai/" target="_blank">π οΈ Audio.Z.AI</a>
</p>
## π Model Introduction
GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. The system adopts a two-stage architecture combining an LLM for speech token generation and a Flow Matching model for waveform synthesis.
By introducing a **Multi-Reward Reinforcement Learning** framework, GLM-TTS significantly improves the expressiveness of generated speech, achieving more natural emotional control compared to traditional TTS systems.
### Key Features
* **Zero-shot Voice Cloning:** Clone any speaker's voice with just 3-10 seconds of prompt audio.
* **RL-enhanced Emotion Control:** Utilizes a multi-reward reinforcement learning framework (GRPO) to optimize prosody and emotion.
* **High-quality Synthesis:** Generates speech comparable to commercial systems with reduced Character Error Rate (CER).
* **Phoneme-level Control:** Supports "Hybrid Phoneme + Text" input for precise pronunciation control (e.g., polyphones).
* **Streaming Inference:** Supports real-time audio generation suitable for interactive applications.
* **Bilingual Support:** Optimized for Chinese and English mixed text.
## System Architecture
GLM-TTS follows a two-stage design:
1. **Stage 1 (LLM):** A Llama-based model converts input text into speech token sequences.
2. **Stage 2 (Flow Matching):** A Flow model converts token sequences into high-quality mel-spectrograms, which are then turned into waveforms by a vocoder.
<div align="center">
<img src="assets/images/architecture.png" width="60%" alt="GLM-TTS Architecture">
</div>
### Reinforcement Learning Alignment
To tackle flat emotional expression, GLM-TTS uses a **Group Relative Policy Optimization (GRPO)** algorithm with multiple reward functions (Similarity, CER, Emotion, Laughter) to align the LLM's generation strategy.
## Evaluation Results
Evaluated on `seed-tts-eval`. **GLM-TTS_RL** achieves the lowest Character Error Rate (CER) while maintaining high speaker similarity.
| Model | CER β | SIM β | Open-source |
| :--- | :---: | :---: | :---: |
| Seed-TTS | 1.12 | **79.6** | π No |
| CosyVoice2 | 1.38 | 75.7 | π Yes |
| F5-TTS | 1.53 | 76.0 | π Yes |
| **GLM-TTS (Base)** | 1.03 | 76.1 | π Yes |
| **GLM-TTS_RL (Ours)** | **0.89** | 76.4 | π Yes |
## Quick Start
### Installation
```bash
git clone [https://github.com/zai-org/GLM-TTS.git](https://github.com/zai-org/GLM-TTS.git)
cd GLM-TTS
pip install -r requirements.txt
```
#### Command Line Inference
```bash
python glmtts_inference.py \
--data=example_zh \
--exp_name=_test \
--use_cache \
# --phoneme # Add this flag to enable phoneme capabilities.
```
#### Shell Script Inference
```bash
bash glmtts_inference.sh
```
## Acknowledgments & Citation
We thank the following open-source projects for their support:
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - Providing frontend processing framework and high-quality vocoder
- [Llama](https://github.com/meta-llama/llama) - Providing basic language model architecture
- [Vocos](https://github.com/charactr-platform/vocos) - Providing high-quality vocoder
- [GRPO-Zero](https://github.com/policy-gradient/GRPO-Zero) - Reinforcement learning algorithm implementation inspiration
If you use GLM-TTS in your research, please cite:
```bibtex
@misc{glmtts2025,
title={GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning},
author={CogAudio Group Members},
year={2025},
publisher={Zhipu AI Inc}
} |