--- library_name: transformers --- # Malaysian-TTS-4.0B-v0.1 Continue pretraining [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) on [mesolitica/Malaysian-TTS-v2](https://huggingface.co/datasets/mesolitica/Malaysian-TTS-v2), 1. Use [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec) as speech detokenizer, output in 24k sample rate. 2. Optional controllable pitch and speed for each words. 3. Support context switching between Malay and English. 4. Support streamable text segment. 5. Support `husein` and `idayu` speakers only. ## How do we train 1. Dataset purely synthetic generated using [mesolitica/Malaysian-Podcast-Dia-1.6B](https://huggingface.co/mesolitica/Malaysian-Podcast-Dia-1.6B). 2. Multipacking with proper document masking on 4096 context length. 3. FP32-BF16 mixed precision training. 4. Full parameter finetuning. 5. WanDB at https://wandb.ai/huseinzol05/Qwen-Qwen3-4B-Base-4k-TTS-distilcodec ## How to use 1. First install DistilCodec, ```bash pip3 install git+https://github.com/mesolitica/DistilCodec ``` 2. Load the models, ```python # wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/model_config.json # wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/g_00204000 from distilcodec import DistilCodec, demo_for_generate_audio_codes from transformers import AutoTokenizer, AutoModelForCausalLM codec_model_config_path='model_config.json' codec_ckpt_path = 'g_00204000' codec = DistilCodec.from_pretrained( config_path=codec_model_config_path, model_path=codec_ckpt_path, use_generator=True, is_debug=False).eval() tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-TTS-4B-v0.1') model = AutoModelForCausalLM.from_pretrained('mesolitica/Malaysian-TTS-4B-v0.1', torch_dtype = 'auto').cuda() ``` ### Non-streaming ```bash import soundfile as sf string = 'The first anti-hoax legislation in the world, Akta Anti Berita Tidak Benar two thousand and eighteen. Saya nak makan nasi ayam.' left = 'idayu' +': ' + string prompt = f'<|im_start|>{left}<|speech_start|>' generate_kwargs = dict( **tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'), max_new_tokens=1024, temperature=0.5, do_sample=True, repetition_penalty=1.0, ) generation_output = model.generate(**generate_kwargs) speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[1].replace('<|endoftext|>', '') numbers = re.findall(r'speech_(\d+)', speech_token) d = list(map(int, numbers)) y_gen = codec.decode_from_codes(d, minus_token_offset=False) sf.write('output.mp3', y_gen[0, 0].cpu().numpy(), 24000) ``` Output, 1. [output-idayu.mp3](output-idayu.mp3) 2. [output-husein.mp3](output-husein.mp3) ## Source code Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/qwen-tts ## Acknowledgement Special thanks to https://www.sns.com.my and Nvidia for 1x H100!