--- library_name: transformers tags: [] --- # Malaysian-TTS-0.6B-v1 Continue pretraining [mesolitica/Malaysian-TTS-0.6B-v0.1](https://huggingface.co/mesolitica/Malaysian-TTS-0.6B-v0.1) on much consistent dataset, 1. Use [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec) as speech detokenizer, output in 24k sample rate. 2. Support context switching between Malay and English. 3. Better pronunciation for letters. 4. Better repetitive tolerance. ## Speakers 1. [husein](https://huggingface.co/datasets/mesolitica/Malaysian-TTS-v2) 2. [idayu](https://huggingface.co/datasets/mesolitica/Malaysian-TTS-v2) 3. [singaporean](https://huggingface.co/datasets/mesolitica/IMDA-TTS) 4. [DisfluencySpeech](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech) 5. [singlish-speaker2050](https://huggingface.co/datasets/thucdangvan020999/singlish-speaker2050) 6. [singlish-speaker2202](https://huggingface.co/datasets/thucdangvan020999/singlish-speaker2202) 7. [haqkiem](https://www.linkedin.com/in/haqkiem-daim/), private dataset. ## How do we train 1. Multipacking with proper document masking on 4096 context length. 2. FP32-BF16 mixed precision training. 3. Full parameter finetuning. 4. WanDB at https://wandb.ai/huseinzol05/Malaysian-TTS-0.6B-v1 ## How to use 1. First install DistilCodec, ```bash pip3 install git+https://github.com/mesolitica/DistilCodec ``` 2. Load the models, ```python # wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/model_config.json # wget https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0/resolve/main/g_00204000 from distilcodec import DistilCodec, demo_for_generate_audio_codes from transformers import AutoTokenizer, AutoModelForCausalLM codec_model_config_path='model_config.json' codec_ckpt_path = 'g_00204000' codec = DistilCodec.from_pretrained( config_path=codec_model_config_path, model_path=codec_ckpt_path, use_generator=True, is_debug=False).eval() tokenizer = AutoTokenizer.from_pretrained('mesolitica/Malaysian-TTS-0.6B-v1') model = AutoModelForCausalLM.from_pretrained('mesolitica/Malaysian-TTS-0.6B-v1', torch_dtype = 'auto').cuda() ``` 3. Generate, ```bash import soundfile as sf from tqdm import tqdm speakers = [ 'husein', 'idayu', 'singaporean', 'DisfluencySpeech', 'singlish-speaker2050', 'singlish-speaker2202', 'haqkiem', ] string = 'IC saya adalah, sembilan enam, kosong tiga, satu empat, one, one, one, one, A, B, C, D, and yes, what can I help you sir?' for s in tqdm(speakers): left = s +': ' + string prompt = f'<|im_start|>{left}<|speech_start|>' generate_kwargs = dict( **tokenizer(prompt, return_tensors = 'pt', add_special_tokens = False).to('cuda'), max_new_tokens=1024, temperature=0.7, do_sample=True, repetition_penalty=1.1, ) generation_output = model.generate(**generate_kwargs) speech_token = tokenizer.decode(generation_output[0]).split('<|speech_start|>')[-1].replace('<|endoftext|>', '') numbers = re.findall(r'speech_(\d+)', speech_token) d = list(map(int, numbers)) y_gen = codec.decode_from_codes( d, minus_token_offset=False ) sf.write(f'{s}.mp3', y_gen[0, 0].cpu().numpy(), 24000) ``` Output, 1. [husein-0.6b.mp3](husein-0.6b.mp3) 2. [idayu-0.6b.mp3](idayu-0.6b.mp3) 3. [singaporean-0.6b.mp3](singaporean-0.6b.mp3) 4. [DisfluencySpeech-0.6b.mp3](DisfluencySpeech-0.6b.mp3) 5. [singlish-speaker2050-0.6b.mp3](singlish-speaker2050-0.6b.mp3) 6. [singlish-speaker2202-0.6b.mp3](singlish-speaker2202-0.6b.mp3) 6. [haqkiem-0.6b.mp3](haqkiem-0.6b.mp3) ## Limitation 1. This model trained on normalized text, so if you have text such as `123`, you have to normalize it first to become `one two three` or `one hundred twenty three` or `satu dua tiga` or `seratus dua puluh tiga`. Feel free to use Malaya for normalization, Malaya support Malay and English normalization, read more at https://github.com/mesolitica/malaya/issues/247#issuecomment-3030313021 2. The repetitive pronunciation dataset does not consistently use commas to for pauses. For example, `A, A, A, A, B, B` in our recordings is spoken as `A A A A B B`. We have no intention to improve it due to cost, but continue finetune using proper dataset should able to solve it. ## Source code Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/qwen-tts ## Acknowledgement Special thanks to https://www.sns.com.my and Nvidia for 1x H100!