Llasa Goes RL: Training LLaSA with GRPO for Improved Prosody and Expressiveness

Community Article Published November 5, 2025

Over the past year, LLaSA has emerged as one of the most practical frameworks for LLM-based speech synthesis—a single, autoregressive Transformer that generates speech tokens using the same paradigm as text LLMs. Building on this architecture, we recently explored a new direction: training Llasa with Reinforcement Learning, using GRPO (Generative Reward Policy Optimization).

This post walks through the ideas, setup, and early results of Fine-tuning Llasa with GRPO, available here: https://github.com/Deep-unlearning/Llasa-GRPO

The setup is highly inspired by Group Relative Policy Optimization for Text-to-Speech with Large Language Models especially the reward modeling

The focus is on leveraging learned reward models for prosody, with the goal of producing more expressive, natural, and context-aware speech.


1. Motivation: Why Reinforcement Learning for TTS?

Most LLM-based TTS models are trained with standard maximum likelihood estimation (MLE). The model is rewarded for reproducing the average target token sequence. But human speech isn't average:

  • Some phrases need emphasis
  • Sentences follow a melody (intonation)
  • Speakers express emotion and rhythm

MLE tends to encourage safe and flat prosody.

This is where Reinforcement Learning enters. Instead of training the model only to copy reference speech, we train it to optimize for qualities we actually care about: clarity, expressiveness, rhythm, speaker identity consistency, etc.

GRPO is particularly suited to this because:

  • It operates on discrete tokens (like Llasa output)
  • It works with a policy model + reward model
  • It does not require differentiability of the reward

2. GRPO in a Nutshell

GRPO (Generative Reward Policy Optimization) is a reinforcement learning algorithm designed for large autoregressive sequence models. In short:

  1. Generate candidate outputs from the model (policy).
  2. Score them using a reward model.
  3. Adjust the model parameters to increase probability of high-reward sequences and decrease probability of low-reward sequences.

3. Architecture Recap: Llasa + Xcodec2

We still rely on the Llasa pipeline:

Component Role
LLaSA Transformer Autoregressive model generating speech tokens
Xcodec2 Tokenizer converting waveform into discrete speech tokens

Because the speech is already represented as discrete tokens, the RL loop is efficient and avoids frame-level DSP complications.


4. The GRPO Training Pipeline

The GRPO fine-tuning script is in the repository:

Llasa-GRPO/
│
├─ create_dataset.py         # Script to prepare datasets (tokenize audio, build prompts)
├─ train.py                  # Main GRPO training loop using the policy model
├─ inference.py              # Script for inference: generate speech from text using trained model
├─ reward_whisper.py         # Reward computation module using ASR+WER via Whisper
├─ requirements.txt          # Dependency list
└─ README.md                 # Documentation & usage instructions

4.1 Reward Model: Measuring Speech Quality

Our current experiments use a composite reward combining word error rate and negative log-likelihood: R=λw+λnλwRWER+λnRNLL R = \frac{\lambda_w + \lambda_n}{ \frac{\lambda_w}{R_{WER}} + \frac{\lambda_n}{R_{NLL}} }

where:

  • RWER=1tanh(αw.WER) R_{WER} = 1 - tanh(\alpha_w . WER)

  • RNLL=exp(NLLαn) R_{NLL} = exp(- \frac{NLL}{\alpha_n})

  • αw \alpha_w and αn \alpha_n control the sensitivity of the reward function to WER and NLL

  • λw \lambda_w and λn \lambda_n denote the weights assigned to WER and NLL rewards

4.2 Dataset Preparation and Audio Tokenization

Llasa operates on discrete speech tokens, so the audio must be tokenized with XCodec2 before training. The script create_dataset.py handles this end-to-end:

python create_dataset.py \
  --dataset your_dataset_name_or_path \
  --output_dir tokenized_dataset \
  --codec_id HKUSTAudio/xcodec2 \
  --sampling_rate 16000

This script performs the following steps:

Step Description
1. Audio Loading Reads raw waveform audio files from the dataset
2. Tokenization Converts audio into XCodec2 speech tokens
3. Text Alignment Builds paired text → speech token training sequences
4. Formatting Saves the processed dataset in a format ready for training or uploading to Hugging Face

If you already have a dataset that includes XCodec2 speech tokens, you can skip this step and pass it directly to train.py via --dataset.

4.3 Running Training

accelerate launch train.py \
  --model_name_or_path HKUSTAudio/Llasa-1B-Multilingual \
  --dataset your_dataset.json \
  --reward_config reward_models/prosody_reward.json \
  --output_dir llasa-grpo-exp1

Hardware:

  • 1×A100 (40GB) for Llasa-1B
  • 2–4 GPUs recommended for stable RL training

5. Running Results

Below is the reward curve recorded during GRPO training. We trained Llasa-1B using Whisper-Large v3 to compute the WER-based and NLL-based reward signals during optimization.

W&B Chart 05_11_2025 10_53_53

Sample Before & After

Model Sample Link
Llasa baseline
Llasa + GRPO

6. Results: What Changes After GRPO Training?

✅ Improvements observed:

  • The proposed GRPO approach substantially improves both the semantic consistency (i.e., alignment of generated speech with text) and the naturalness of synthesized speech.
  • The combined reward function (WER + NLL) offers better training stability and higher-quality outputs than either component alone.
  • In zero-shot TTS evaluation across multiple languages, fine-tuning with GRPO reduced character/word error rates (CER/WER) and improved mean opinion score (MOS) for naturalness.

⚠️ Things to monitor:

  • While intelligibility and semantic consistency improved, speaker similarity did not uniformly increase; some cases showed comparable or only marginal gains in speaker style retention.
  • Even with the improved reward scheme, the authors note that optimizing only with ASR-based metrics may not fully capture all perceptual aspects of speech (e.g., emotion, expressiveness, prosody nuance).

7. What’s Next?

Upcoming work includes:

Direction Goal
Learned prosody reward model Replace handcrafted metrics with neural reward
Human feedback RL (RLAIF) Evaluate perceived emotional quality directly
Speaker adaptation GRPO Optimize stylistic variation per speaker identity

The ultimate target: controllable, emotionally expressive multilingual speech.


Closing

This experiment shows that speech synthesis is entering the same RL-driven era as text LLMs. By shaping how the model speaks, rather than just what it says, we open the door to voice assistants, audiobook narrators, and virtual characters that actually sound alive.

Repository: https://github.com/Deep-unlearning/Llasa-GRPO

If you test or build on this, I would be glad to hear your results!

Community

Sign up or log in to comment