File size: 4,526 Bytes
b5a5615
 
 
 
 
efc993c
b5a5615
 
 
 
 
efc993c
b5a5615
 
 
 
 
779bac0
 
 
 
 
 
 
fc2de20
779bac0
 
d4c7516
b5a5615
9053f3d
b5a5615
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4edc284
b5a5615
 
 
 
 
 
 
 
4edc284
b5a5615
 
 
 
 
 
 
9053f3d
b5a5615
 
 
 
 
4edc284
b5a5615
 
 
3913675
b5a5615
 
 
 
 
 
 
4edc284
b5a5615
 
 
 
 
 
b53b19a
 
 
 
 
 
 
 
 
 
 
 
 
1bf48fc
b53b19a
 
 
 
 
4edc284
b53b19a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
language:
- zh
- en
tags:
- llm
- tts
- zero-shot
- voice-cloning
- reinforcement-learning
- flow-matching
license: mit
pipeline_tag: text-to-speech
---

# GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS

<div align="center">
<a href="README.md">
    <img src="https://img.shields.io/badge/Language_/_语言-English-blue?style=flat-square" alt="English">
</a>
<a href="README_zh.md">
    <img src="https://img.shields.io/badge/Language_/_语言-δΈ­ζ–‡-red?style=flat-square" alt="Chinese">
</a>
</div>

<br><br>

<div align="center">
<img src="assets/images/logo.svg" width="50%"/>
</div>

<p align="center">
    <a href="https://github.com/zai-org/GLM-TTS" target="_blank">πŸ’» GitHub Repository</a>
    &nbsp;&nbsp;|&nbsp;&nbsp;
    <a href="https://huggingface.co/spaces/zai-org/GLM-TTS" target="_blank">πŸ€— Online Demo</a>
    &nbsp;&nbsp;|&nbsp;&nbsp;
    <a href="https://audio.z.ai/" target="_blank">πŸ› οΈ Audio.Z.AI</a>
</p>

## πŸ“– Model Introduction

GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. The system adopts a two-stage architecture combining an LLM for speech token generation and a Flow Matching model for waveform synthesis.

By introducing a **Multi-Reward Reinforcement Learning** framework, GLM-TTS significantly improves the expressiveness of generated speech, achieving more natural emotional control compared to traditional TTS systems.

### Key Features

* **Zero-shot Voice Cloning:** Clone any speaker's voice with just 3-10 seconds of prompt audio.
* **RL-enhanced Emotion Control:** Utilizes a multi-reward reinforcement learning framework (GRPO) to optimize prosody and emotion.
* **High-quality Synthesis:** Generates speech comparable to commercial systems with reduced Character Error Rate (CER).
* **Phoneme-level Control:** Supports "Hybrid Phoneme + Text" input for precise pronunciation control (e.g., polyphones).
* **Streaming Inference:** Supports real-time audio generation suitable for interactive applications.
* **Bilingual Support:** Optimized for Chinese and English mixed text.

## System Architecture

GLM-TTS follows a two-stage design:

1.  **Stage 1 (LLM):** A Llama-based model converts input text into speech token sequences.
2.  **Stage 2 (Flow Matching):** A Flow model converts token sequences into high-quality mel-spectrograms, which are then turned into waveforms by a vocoder.

<div align="center">
  <img src="assets/images/architecture.png" width="60%" alt="GLM-TTS Architecture">
</div>

### Reinforcement Learning Alignment
To tackle flat emotional expression, GLM-TTS uses a **Group Relative Policy Optimization (GRPO)** algorithm with multiple reward functions (Similarity, CER, Emotion, Laughter) to align the LLM's generation strategy.

## Evaluation Results

Evaluated on `seed-tts-eval`. **GLM-TTS_RL** achieves the lowest Character Error Rate (CER) while maintaining high speaker similarity.

| Model | CER ↓ | SIM ↑ | Open-source |
| :--- | :---: | :---: | :---: |
| Seed-TTS | 1.12 | **79.6** | πŸ”’ No |
| CosyVoice2 | 1.38 | 75.7 | πŸ‘ Yes |
| F5-TTS | 1.53 | 76.0 | πŸ‘ Yes |
| **GLM-TTS (Base)** | 1.03 | 76.1 | πŸ‘ Yes |
| **GLM-TTS_RL (Ours)** | **0.89** | 76.4 | πŸ‘ Yes |

## Quick Start

### Installation

```bash
git clone [https://github.com/zai-org/GLM-TTS.git](https://github.com/zai-org/GLM-TTS.git)
cd GLM-TTS
pip install -r requirements.txt
```

#### Command Line Inference

```bash
python glmtts_inference.py \
    --data=example_zh \
    --exp_name=_test \
    --use_cache \
    # --phoneme # Add this flag to enable phoneme capabilities.
```

#### Shell Script Inference

```bash
bash glmtts_inference.sh
```

## Acknowledgments & Citation

We thank the following open-source projects for their support:

- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - Providing frontend processing framework and high-quality vocoder
- [Llama](https://github.com/meta-llama/llama) - Providing basic language model architecture
- [Vocos](https://github.com/charactr-platform/vocos) - Providing high-quality vocoder
- [GRPO-Zero](https://github.com/policy-gradient/GRPO-Zero) - Reinforcement learning algorithm implementation inspiration

If you use GLM-TTS in your research, please cite:

```bibtex
@misc{glmtts2025,
  title={GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning},
  author={CogAudio Group Members},
  year={2025},
  publisher={Zhipu AI Inc}
}