Llasa Goes RL: Training LLaSA with GRPO for Improved Prosody and Expressiveness
Over the past year, LLaSA has emerged as one of the most practical frameworks for LLM-based speech synthesis—a single, autoregressive Transformer that generates speech tokens using the same paradigm as text LLMs. Building on this architecture, we recently explored a new direction: training Llasa with Reinforcement Learning, using GRPO (Generative Reward Policy Optimization).
This post walks through the ideas, setup, and early results of Fine-tuning Llasa with GRPO, available here: https://github.com/Deep-unlearning/Llasa-GRPO
The setup is highly inspired by Group Relative Policy Optimization for Text-to-Speech with Large Language Models especially the reward modeling
The focus is on leveraging learned reward models for prosody, with the goal of producing more expressive, natural, and context-aware speech.
1. Motivation: Why Reinforcement Learning for TTS?
Most LLM-based TTS models are trained with standard maximum likelihood estimation (MLE). The model is rewarded for reproducing the average target token sequence. But human speech isn't average:
- Some phrases need emphasis
- Sentences follow a melody (intonation)
- Speakers express emotion and rhythm
MLE tends to encourage safe and flat prosody.
This is where Reinforcement Learning enters. Instead of training the model only to copy reference speech, we train it to optimize for qualities we actually care about: clarity, expressiveness, rhythm, speaker identity consistency, etc.
GRPO is particularly suited to this because:
- It operates on discrete tokens (like Llasa output)
- It works with a policy model + reward model
- It does not require differentiability of the reward
2. GRPO in a Nutshell
GRPO (Generative Reward Policy Optimization) is a reinforcement learning algorithm designed for large autoregressive sequence models. In short:
- Generate candidate outputs from the model (policy).
- Score them using a reward model.
- Adjust the model parameters to increase probability of high-reward sequences and decrease probability of low-reward sequences.
3. Architecture Recap: Llasa + Xcodec2
We still rely on the Llasa pipeline:
| Component | Role |
|---|---|
| LLaSA Transformer | Autoregressive model generating speech tokens |
| Xcodec2 | Tokenizer converting waveform into discrete speech tokens |
Because the speech is already represented as discrete tokens, the RL loop is efficient and avoids frame-level DSP complications.
4. The GRPO Training Pipeline
The GRPO fine-tuning script is in the repository:
Llasa-GRPO/
│
├─ create_dataset.py # Script to prepare datasets (tokenize audio, build prompts)
├─ train.py # Main GRPO training loop using the policy model
├─ inference.py # Script for inference: generate speech from text using trained model
├─ reward_whisper.py # Reward computation module using ASR+WER via Whisper
├─ requirements.txt # Dependency list
└─ README.md # Documentation & usage instructions
4.1 Reward Model: Measuring Speech Quality
Our current experiments use a composite reward combining word error rate and negative log-likelihood:
where:
and control the sensitivity of the reward function to WER and NLL
and denote the weights assigned to WER and NLL rewards
4.2 Dataset Preparation and Audio Tokenization
Llasa operates on discrete speech tokens, so the audio must be tokenized with XCodec2 before training.
The script create_dataset.py handles this end-to-end:
python create_dataset.py \
--dataset your_dataset_name_or_path \
--output_dir tokenized_dataset \
--codec_id HKUSTAudio/xcodec2 \
--sampling_rate 16000
This script performs the following steps:
| Step | Description |
|---|---|
| 1. Audio Loading | Reads raw waveform audio files from the dataset |
| 2. Tokenization | Converts audio into XCodec2 speech tokens |
| 3. Text Alignment | Builds paired text → speech token training sequences |
| 4. Formatting | Saves the processed dataset in a format ready for training or uploading to Hugging Face |
If you already have a dataset that includes XCodec2 speech tokens, you can skip this step and pass it directly to train.py via --dataset.
4.3 Running Training
accelerate launch train.py \
--model_name_or_path HKUSTAudio/Llasa-1B-Multilingual \
--dataset your_dataset.json \
--reward_config reward_models/prosody_reward.json \
--output_dir llasa-grpo-exp1
Hardware:
- 1×A100 (40GB) for Llasa-1B
- 2–4 GPUs recommended for stable RL training
5. Running Results
Below is the reward curve recorded during GRPO training. We trained Llasa-1B using Whisper-Large v3 to compute the WER-based and NLL-based reward signals during optimization.
Sample Before & After
| Model | Sample Link |
|---|---|
| Llasa baseline | |
| Llasa + GRPO |
6. Results: What Changes After GRPO Training?
✅ Improvements observed:
- The proposed GRPO approach substantially improves both the semantic consistency (i.e., alignment of generated speech with text) and the naturalness of synthesized speech.
- The combined reward function (WER + NLL) offers better training stability and higher-quality outputs than either component alone.
- In zero-shot TTS evaluation across multiple languages, fine-tuning with GRPO reduced character/word error rates (CER/WER) and improved mean opinion score (MOS) for naturalness.
⚠️ Things to monitor:
- While intelligibility and semantic consistency improved, speaker similarity did not uniformly increase; some cases showed comparable or only marginal gains in speaker style retention.
- Even with the improved reward scheme, the authors note that optimizing only with ASR-based metrics may not fully capture all perceptual aspects of speech (e.g., emotion, expressiveness, prosody nuance).
7. What’s Next?
Upcoming work includes:
| Direction | Goal |
|---|---|
| Learned prosody reward model | Replace handcrafted metrics with neural reward |
| Human feedback RL (RLAIF) | Evaluate perceived emotional quality directly |
| Speaker adaptation GRPO | Optimize stylistic variation per speaker identity |
The ultimate target: controllable, emotionally expressive multilingual speech.
Closing
This experiment shows that speech synthesis is entering the same RL-driven era as text LLMs. By shaping how the model speaks, rather than just what it says, we open the door to voice assistants, audiobook narrators, and virtual characters that actually sound alive.
Repository: https://github.com/Deep-unlearning/Llasa-GRPO
If you test or build on this, I would be glad to hear your results!
