MOSS-TTS-Realtime ONNX Inference

Pure ONNX Runtime inference pipeline for MOSS-TTS-Realtime, enabling streaming text-to-speech without any PyTorch or Hugging Face Transformers dependency at runtime.

Overview

This repository provides:

  • inferencer_onnx.py β€” Core streaming TTS engine that orchestrates four ONNX models (backbone LLM, local transformer, codec encoder, codec decoder) using only NumPy and ONNX Runtime.
  • moss_text_tokenizer.py β€” Lightweight Qwen3-compatible tokenizer wrapping the tokenizers library, with no transformers dependency.
  • test_basic_streaming-onnx.py β€” End-to-end test script that simulates LLM streaming text and produces a WAV file.

Architecture

Reference Audio ──► Codec Encoder ──► RVQ Audio Codes (voice clone context)
                                           β”‚
                                           β–Ό
Text Deltas ──► Backbone LLM (Qwen3-1.7B) ──► Hidden States
                                                    β”‚
                                                    β–Ό
                                            Local Transformer ──► 16-codebook Audio Tokens
                                                                        β”‚
                                                                        β–Ό
                                                                Codec Decoder ──► 24 kHz Waveform
Component ONNX Model Description
Backbone LLM backbone_llm.onnx Qwen3-based causal LM mapping interleaved text+audio tokens to hidden states. Maintains a growing KV-cache across the entire generation.
Local Transformer backbone_local.onnx Depth-wise decoder generating 16 RVQ codebook entries per frame from backbone hidden states. Creates and discards a fresh KV-cache per frame.
Codec Encoder codec_encoder.onnx Encodes reference speaker waveform into RVQ codes for voice cloning. Run once per session.
Codec Decoder codec_decoder.onnx Decodes RVQ audio codes back to 24 kHz waveform. Maintains four hierarchical KV-caches for streaming decode.

Requirements

numpy
onnxruntime
soundfile
librosa
tokenizers

Install with:

pip install numpy onnxruntime soundfile librosa tokenizers

Directory Structure

.
β”œβ”€β”€ inferencer_onnx.py              # Core ONNX inference engine
β”œβ”€β”€ moss_text_tokenizer.py          # Lightweight Qwen3 tokenizer
β”œβ”€β”€ test_basic_streaming-onnx.py    # End-to-end test script
β”œβ”€β”€ README.md
β”œβ”€β”€ onnx_models/  # FP32
β”‚   β”œβ”€β”€ backbone_f32/
β”‚   β”‚   └── backbone_f32.onnx
β”‚   β”œβ”€β”€ local_transformer/
β”‚   β”‚   └── local_transformer_f32.onnx
β”‚   β”œβ”€β”€ codec_decoder/
β”‚   β”‚   └── codec_decoder.onnx
β”‚   └── codec_encoder/
β”‚       └── codec_encoder.onnx
β”œβ”€β”€ onnx_models/
β”‚   β”œβ”€β”€ codec_decoder_int8/
β”‚   β”‚   └── codec_decoder_int8.onnx
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ config_backbone.json
β”‚   └── config_codec.json
β”œβ”€β”€ tokenizers/
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   └── tokenizer_config.json
β”œβ”€β”€ audio_ref/
β”‚   └── <reference_speaker>.[wav|mp3|flac]
└── audio_synth/
    └── <synthesized_example>.wav

Usage

Basic Streaming TTS

Notes 1

  • With float32, all models loaded will consume about 13GB. It will OOM after about 120 steps on 16GB RAM.
  • With <= 16GB RAM, you can use quantized (INT8) codec decoder to avoid OOM. Quantized codec encoder can also be further used but degrades the performance.
  • With quantized (INT8) backbone_llm and backbone_local_transformer, the performance will be unacceptable and most of the times gibberish/hallucinates.
  • BF16, as the original MOSS-TTS model is saved, is not yet fully supported on most CPUs. If you want to use GPU, you can convert the fp32 model.
  • We also noted that the performance when using FP32 (torch/ONNX) on backbone_llm and backbone_local is a bit unstable compared to bf16 (torch). Probably due to the training with bfloat16 and excessed in numerical range with fp32 inference.
  • So, perhaps the better option is to use ONNX converted to fp16 with GPU/supported CPU. We tried with m8a.xlarge and m8a.2xlarge instances, they do not support CPU with fp16.

Notes 2

  • The KV caching mechanism is modified to use input past_kv tensor/array and initialized with empty on the time dimension so no need to export two ONNXs for prefill and step. In this case, one ONNX can handle both initializing and continuing. This mechanism is all for the backbone_llm (Qwen3Model), backbone_local_transformer, and codec_decoder. The codec_encoder always receives full sequence.

Notes 3

  • Text by default in Russian, you can modify in the args. The prompt for the speaker is also modified in Russian, you can change in the inferencer_onnx.py.
  • This prompting and default decoding hyperparameters (temp, top_p, top_k, repetition) has been optimized for Russian, and you can probably change for your language.
  • The default prompt from MOSS-TTS is given in English, and we investigated you can slightly modify and even change to your targeted language to produce consistent accent/nativeness as we are using for the Russian within the MOSSTTSRealtimeProcessor.

Example

  • With quantized (INT8) codec decoder (requires at least 13GB RAM)
python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models_quantized/codec_decoder_int8/codec_decoder_int8.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav
  • With all FP32 (requires > 16GB RAM)
python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models/codec_decoder/codec_decoder.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav

Programmatic Usage

import json
import onnxruntime as ort
from inferencer_onnx import MossTTSRealtimeInferenceONNX
from moss_text_tokenizer import MOSSTextTokenizer

# Load tokenizer and ONNX sessions
tokenizer = MOSSTextTokenizer("tokenizers/tokenizer.json",
                               "tokenizers/tokenizer_config.json")
backbone_llm = ort.InferenceSession("onnx_models/backbone_llm.onnx",
                                     providers=["CPUExecutionProvider"])
backbone_local = ort.InferenceSession("onnx_models/backbone_local.onnx",
                                       providers=["CPUExecutionProvider"])
codec_decoder = ort.InferenceSession("onnx_models/codec_decoder.onnx",
                                      providers=["CPUExecutionProvider"])
codec_encoder = ort.InferenceSession("onnx_models/codec_encoder.onnx",
                                      providers=["CPUExecutionProvider"])

with open("configs/config_backbone.json") as f:
    backbone_config = json.load(f)
with open("configs/config_codec.json") as f:
    codec_config = json.load(f)

# Create inferencer
inferencer = MossTTSRealtimeInferenceONNX(
    tokenizer, backbone_llm, backbone_local,
    codec_decoder, codec_encoder,
    backbone_config, codec_config,
)

# Encode reference speaker for voice cloning
prompt_tokens = inferencer._encode_reference_audio("audio/speaker.wav")
input_ids = inferencer.processor.make_ensemble(prompt_tokens.squeeze(1))
inferencer.reset_turn(input_ids=input_ids, include_system_prompt=False,
                      reset_cache=True)

# Stream text and collect audio
for delta in your_llm_stream():
    audio_frames = inferencer.push_text(delta)
    for frame in audio_frames:
        # push_tokens + audio_chunks for waveform decoding
        ...

Command-Line Arguments

Argument Type Default Description
--tokenizer_vocab_path str required Path to tokenizer.json
--tokenizer_config_path str required Path to tokenizer_config.json
--backbone_llm_path str required Path to backbone LLM ONNX model
--backbone_local_path str required Path to local transformer ONNX model
--codec_decoder_path str required Path to codec decoder ONNX model
--codec_encoder_path str required Path to codec encoder ONNX model
--backbone_config_path str required Path to config_backbone.json
--codec_config_path str required Path to config_codec.json
--prompt_wav str required Reference speaker audio for voice cloning
--out_wav str out_streaming.wav Output WAV file path
--sample_rate int 24000 Output sample rate (Hz)
--temperature float 0.725 Sampling temperature
--top_p float 0.6 Nucleus sampling threshold
--top_k int 34 Top-k sampling cutoff
--repetition_penalty float 1.9 Repetition penalty coefficient
--repetition_window int 50 Window for repetition penalty
--max_length int 5000 Maximum generation steps
--delta_chunk_chars int 1 Characters per simulated LLM delta
--delta_delay_s float 0.0 Delay between simulated deltas (seconds)
--assistant_text str (Russian text) Text to synthesize

Acknowledgments

This work builds upon the MOSS-TTS-Realtime model by OpenMOSS Team and the MOSS-Audio-Tokenizer codec.

License

Copyright 2026 Patrick Lumbantobing, Vertox-AI

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pltobing/MOSS-TTS-Realtime-ONNX

Quantized
(1)
this model