MOSS-TTS-Realtime ONNX Inference
Pure ONNX Runtime inference pipeline for MOSS-TTS-Realtime, enabling streaming text-to-speech without any PyTorch or Hugging Face Transformers dependency at runtime.
Overview
This repository provides:
inferencer_onnx.pyβ Core streaming TTS engine that orchestrates four ONNX models (backbone LLM, local transformer, codec encoder, codec decoder) using only NumPy and ONNX Runtime.moss_text_tokenizer.pyβ Lightweight Qwen3-compatible tokenizer wrapping thetokenizerslibrary, with notransformersdependency.test_basic_streaming-onnx.pyβ End-to-end test script that simulates LLM streaming text and produces a WAV file.
Architecture
Reference Audio βββΊ Codec Encoder βββΊ RVQ Audio Codes (voice clone context)
β
βΌ
Text Deltas βββΊ Backbone LLM (Qwen3-1.7B) βββΊ Hidden States
β
βΌ
Local Transformer βββΊ 16-codebook Audio Tokens
β
βΌ
Codec Decoder βββΊ 24 kHz Waveform
| Component | ONNX Model | Description |
|---|---|---|
| Backbone LLM | backbone_llm.onnx |
Qwen3-based causal LM mapping interleaved text+audio tokens to hidden states. Maintains a growing KV-cache across the entire generation. |
| Local Transformer | backbone_local.onnx |
Depth-wise decoder generating 16 RVQ codebook entries per frame from backbone hidden states. Creates and discards a fresh KV-cache per frame. |
| Codec Encoder | codec_encoder.onnx |
Encodes reference speaker waveform into RVQ codes for voice cloning. Run once per session. |
| Codec Decoder | codec_decoder.onnx |
Decodes RVQ audio codes back to 24 kHz waveform. Maintains four hierarchical KV-caches for streaming decode. |
Requirements
numpy
onnxruntime
soundfile
librosa
tokenizers
Install with:
pip install numpy onnxruntime soundfile librosa tokenizers
Directory Structure
.
βββ inferencer_onnx.py # Core ONNX inference engine
βββ moss_text_tokenizer.py # Lightweight Qwen3 tokenizer
βββ test_basic_streaming-onnx.py # End-to-end test script
βββ README.md
βββ onnx_models/ # FP32
β βββ backbone_f32/
β β βββ backbone_f32.onnx
β βββ local_transformer/
β β βββ local_transformer_f32.onnx
β βββ codec_decoder/
β β βββ codec_decoder.onnx
β βββ codec_encoder/
β βββ codec_encoder.onnx
βββ onnx_models/
β βββ codec_decoder_int8/
β β βββ codec_decoder_int8.onnx
βββ configs/
β βββ config_backbone.json
β βββ config_codec.json
βββ tokenizers/
β βββ tokenizer.json
β βββ tokenizer_config.json
βββ audio_ref/
β βββ <reference_speaker>.[wav|mp3|flac]
βββ audio_synth/
βββ <synthesized_example>.wav
Usage
Basic Streaming TTS
Notes 1
- With float32, all models loaded will consume about 13GB. It will OOM after about 120 steps on 16GB RAM.
- With <= 16GB RAM, you can use quantized (INT8) codec decoder to avoid OOM. Quantized codec encoder can also be further used but degrades the performance.
- With quantized (INT8) backbone_llm and backbone_local_transformer, the performance will be unacceptable and most of the times gibberish/hallucinates.
- BF16, as the original MOSS-TTS model is saved, is not yet fully supported on most CPUs. If you want to use GPU, you can convert the fp32 model.
- We also noted that the performance when using FP32 (torch/ONNX) on backbone_llm and backbone_local is a bit unstable compared to bf16 (torch). Probably due to the training with bfloat16 and excessed in numerical range with fp32 inference.
- So, perhaps the better option is to use ONNX converted to fp16 with GPU/supported CPU. We tried with m8a.xlarge and m8a.2xlarge instances, they do not support CPU with fp16.
Notes 2
- The KV caching mechanism is modified to use input past_kv tensor/array and initialized with empty on the time dimension so no need to export two ONNXs for prefill and step. In this case, one ONNX can handle both initializing and continuing. This mechanism is all for the backbone_llm (Qwen3Model), backbone_local_transformer, and codec_decoder. The codec_encoder always receives full sequence.
Notes 3
- Text by default in Russian, you can modify in the args. The prompt for the speaker is also modified in Russian, you can change in the inferencer_onnx.py.
- This prompting and default decoding hyperparameters (temp, top_p, top_k, repetition) has been optimized for Russian, and you can probably change for your language.
- The default prompt from MOSS-TTS is given in English, and we investigated you can slightly modify and even change to your targeted language to produce consistent accent/nativeness as we are using for the Russian within the
MOSSTTSRealtimeProcessor.
Example
- With quantized (INT8) codec decoder (requires at least 13GB RAM)
python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models_quantized/codec_decoder_int8/codec_decoder_int8.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav
- With all FP32 (requires > 16GB RAM)
python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models/codec_decoder/codec_decoder.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav
Programmatic Usage
import json
import onnxruntime as ort
from inferencer_onnx import MossTTSRealtimeInferenceONNX
from moss_text_tokenizer import MOSSTextTokenizer
# Load tokenizer and ONNX sessions
tokenizer = MOSSTextTokenizer("tokenizers/tokenizer.json",
"tokenizers/tokenizer_config.json")
backbone_llm = ort.InferenceSession("onnx_models/backbone_llm.onnx",
providers=["CPUExecutionProvider"])
backbone_local = ort.InferenceSession("onnx_models/backbone_local.onnx",
providers=["CPUExecutionProvider"])
codec_decoder = ort.InferenceSession("onnx_models/codec_decoder.onnx",
providers=["CPUExecutionProvider"])
codec_encoder = ort.InferenceSession("onnx_models/codec_encoder.onnx",
providers=["CPUExecutionProvider"])
with open("configs/config_backbone.json") as f:
backbone_config = json.load(f)
with open("configs/config_codec.json") as f:
codec_config = json.load(f)
# Create inferencer
inferencer = MossTTSRealtimeInferenceONNX(
tokenizer, backbone_llm, backbone_local,
codec_decoder, codec_encoder,
backbone_config, codec_config,
)
# Encode reference speaker for voice cloning
prompt_tokens = inferencer._encode_reference_audio("audio/speaker.wav")
input_ids = inferencer.processor.make_ensemble(prompt_tokens.squeeze(1))
inferencer.reset_turn(input_ids=input_ids, include_system_prompt=False,
reset_cache=True)
# Stream text and collect audio
for delta in your_llm_stream():
audio_frames = inferencer.push_text(delta)
for frame in audio_frames:
# push_tokens + audio_chunks for waveform decoding
...
Command-Line Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--tokenizer_vocab_path |
str | required | Path to tokenizer.json |
--tokenizer_config_path |
str | required | Path to tokenizer_config.json |
--backbone_llm_path |
str | required | Path to backbone LLM ONNX model |
--backbone_local_path |
str | required | Path to local transformer ONNX model |
--codec_decoder_path |
str | required | Path to codec decoder ONNX model |
--codec_encoder_path |
str | required | Path to codec encoder ONNX model |
--backbone_config_path |
str | required | Path to config_backbone.json |
--codec_config_path |
str | required | Path to config_codec.json |
--prompt_wav |
str | required | Reference speaker audio for voice cloning |
--out_wav |
str | out_streaming.wav |
Output WAV file path |
--sample_rate |
int | 24000 |
Output sample rate (Hz) |
--temperature |
float | 0.725 |
Sampling temperature |
--top_p |
float | 0.6 |
Nucleus sampling threshold |
--top_k |
int | 34 |
Top-k sampling cutoff |
--repetition_penalty |
float | 1.9 |
Repetition penalty coefficient |
--repetition_window |
int | 50 |
Window for repetition penalty |
--max_length |
int | 5000 |
Maximum generation steps |
--delta_chunk_chars |
int | 1 |
Characters per simulated LLM delta |
--delta_delay_s |
float | 0.0 |
Delay between simulated deltas (seconds) |
--assistant_text |
str | (Russian text) | Text to synthesize |
Acknowledgments
This work builds upon the MOSS-TTS-Realtime model by OpenMOSS Team and the MOSS-Audio-Tokenizer codec.
License
Copyright 2026 Patrick Lumbantobing, Vertox-AI
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for pltobing/MOSS-TTS-Realtime-ONNX
Base model
OpenMOSS-Team/MOSS-TTS