HIT-TMG
/

UniMoE-Audio-Preview

+---
+license: mit
+language:
+- en
+- zh
+tags:
+- MoE
+- Unified Generation
+- Speech and Music
+- Multi-modal
+---
+<h1 align="center">UniMoE-Audio</h1>
+**UniMoE-Audio** is a unified framework that seamlessly combines speech and music generation. Powered by a novel Dynamic-Capacity Mixture-of-Experts architecture.
+<div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
+  <a href="https://mukioxun.github.io/Uni-MoE-site/home.html"><img src="https://img.shields.io/badge/📰 -Website-228B22" style="margin-right: 5px;"></a>
+  <a href="docs/UniMoE_Audio-Paper.pdf"><img src="https://img.shields.io/badge/📄-Paper-8A2BE2" style="margin-right: 5px;"></a>
+</div>
+## Model Information
+- **Base Model**: Qwen2.5-VL with MoE extensions
+- **Audio Codec**: DAC (Descript Audio Codec) with 12 channels
+- **Expert Configuration**: 8 routed experts + 2 shared experts + 1 null expert
+- **Audio Sampling Rate**: 16kHz
+- Usage:
+  - Text-to-Speech (TTS)
+  - Music Generation
+- GPU Requirements:
+  - Memory: 16GB+
+  - CUDA-enabled GPU
+## Open-source Plan
+- [x] Model Checkpoint
+    - [x] [UniMoE-Audio-preview](https://huggingface.co/foggyforest/UniMoE-Audio-preview)
+    - [ ] [UniMoE-Audio]()
+- [x] Training and Inference Code: [HITsz-TMG/UniMoE-Audio](https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs/tree/master/UniMoE-Audio)
+- [x] Technical Report: [UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE]()
+## Evaluation
+### Speech Synthesis
+![Speech Synthesis](./imgs/Speech_Generation.png)
+### Text to Music Generation
+![Text to Music Generation](./imgs/T2M.png)
+### Video-Text to Music Generation
+![Video-Text to Music Generation](./imgs/VT2M.png)
+## Requirements
+We recommend using conda to install the environment.
+```bash
+conda env create -f configs/enviroment.yml      # add -n for your name
+conda activate unimoe-audio                     # default name
+```
+A `dac model` is also required to be downloaded in '/path/to/UniMoE-Audio/utils/dac_model'.
+It will be automatically downloaded when running the first time.
+## Usage
+Here is a code snippet to show you how to use UniMoE-Audio with `transformers`
+```python
+import torch
+import deepspeed_utils # This line is important, do not delete
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
+# Import from utils modules
+from utils import (
+    Dac,
+    preprocess_codec,
+    DecoderOutput,
+    tts_preprocess,
+    t2m_preprocess,
+    v2m_preprocess,
+    prepare_audio_prompt,
+    generate_output
+)
+model_path = "/path/to/your/model"
+dac = Dac()
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    torch_dtype=torch.float32,
+    attn_implementation='sdpa',
+    trust_remote_code=True,
+).eval()
+model = model.to('cuda')
+processor = AutoProcessor.from_pretrained(model_path)
+```
+### TTS Example:
+```python
+transcription = [
+    "The nature reserve covers only a small part of the marsh area.",
+    "我们基于动态容量混合专家框架，构建了一个统一语音和音乐生成模型。"
+]
+prompt_wav = "/path/to/your/voice/prompt"
+prompt_transcription = "content of your voice prompt"
+prompt_codec = preprocess_codec(model, dac.encode(prompt_wav))
+text_input, tts_generation_kwargs = tts_preprocess(transcription, prompt_codec, prompt_transcription, model.device)
+source_input = processor.tokenizer(text_input, add_special_tokens=False, return_tensors="pt", padding=True).to(model.device)
+prefill, prefill_steps = prepare_audio_prompt(model, audio_prompts=[None] * len(transcription))
+dec_output = DecoderOutput(prefill, prefill_steps, model.device)
+with torch.no_grad():
+    generated_codes, lengths_Bx = model.generate(
+        input_ids=source_input.input_ids,
+        attention_mask=source_input.attention_mask,
+        dec_output=dec_output,
+        max_tokens=10 * 50, # maximum duration of the generated audio is 10 seconds
+        min_tokens=1 * 50, # minimum duration of the generated audio is 1 seconds
+        temperature=1.0,
+        top_p=1.0,
+        cfg_filter_top_k=45,
+        do_sample=True,
+        use_cache=True,
+        **tts_generation_kwargs
+    )
+audios = generate_output(model, generated_codes, lengths_Bx)
+for i in range(len(audios)):
+    output_path = os.path.join(f"./generated_speech_{i}.wav")
+    dac.decode(audios[i].transpose(0, 1).unsqueeze(0), save_path=output_path, min_duration=1)
+```
+### T2M Example:
+```python
+caption = [
+    "A retro-inspired synthwave track with a driving beat and nostalgic melodies. Perfect for cruising or late-night drives.",
+    "A mid-tempo electronic track with a driving beat and atmospheric synth textures. Ideal for background listening or a chill dance set."
+]
+text_input, t2m_generation_kwargs = t2m_preprocess(caption)
+source_input = processor.tokenizer(text_input, add_special_tokens=False, return_tensors="pt", padding=True).to(model.device)
+prefill, prefill_steps = prepare_audio_prompt(model, audio_prompts=[None] * len(caption))
+dec_output = DecoderOutput(prefill, prefill_steps, model.device)
+with torch.no_grad():
+    generated_codes, lengths_Bx = model.generate(
+        input_ids=source_input.input_ids,
+        attention_mask=source_input.attention_mask,
+        dec_output=dec_output,
+        max_tokens=20 * 50, # maximum duration of the generated audio is 20 seconds
+        min_tokens=8 * 50, # minimum duration of the generated audio is 8 seconds
+        temperature=1.0,
+        top_p=1.0,
+        cfg_filter_top_k=45,
+        do_sample=True,
+        use_cache=True,
+        **t2m_generation_kwargs
+    )
+audios = generate_output(model, generated_codes, lengths_Bx)
+for i in range(len(audios)):
+    output_path = os.path.join(f"./generated_music_{i}.wav")
+    dac.decode(audios[i].transpose(0, 1).unsqueeze(0), save_path=output_path, min_duration=1)
+```
+### V2M Example:
+```python
+caption = [
+    "A relaxing instrumental piece featuring a simple melody played on a synth flute. The track creates a calm and peaceful atmosphere.",
+]
+video = [
+    "/path/to/your/video/path.mp4",
+]
+text_input,  video_inputs, fps_inputs, v2m_generation_kwargs = v2m_preprocess(caption, video)
+source_input = processor(text=text_input, images=None, videos=video_inputs, fps=fps_inputs, padding=True, return_tensors="pt", do_resize=False)
+source_input = source_input.to(model.device)
+prefill, prefill_steps = prepare_audio_prompt(model, audio_prompts=[None] * len(caption))
+dec_output = DecoderOutput(prefill, prefill_steps, model.device)
+with torch.no_grad():
+    generated_codes, lengths_Bx = model.generate(
+        input_ids=source_input.input_ids,
+        pixel_values_videos=source_input.pixel_values_videos,
+        video_grid_thw=source_input.video_grid_thw,
+        second_per_grid_ts=source_input.second_per_grid_ts,
+        attention_mask=source_input.attention_mask,
+        dec_output=dec_output,
+        max_tokens=20 * 50, # maximum duration of the generated audio is 20 seconds
+        min_tokens=8 * 50, # minimum duration of the generated audio is 8 seconds
+        temperature=1.0,
+        top_p=1.0,
+        cfg_filter_top_k=45,
+        do_sample=True,
+        use_cache=True,
+        **v2m_generation_kwargs
+    )
+audios = generate_output(model, generated_codes, lengths_Bx)
+for i in range(len(audios)):
+    output_path = os.path.join(f"./generated_video_music_{i}.wav")
+    dac.decode(audios[i].transpose(0, 1).unsqueeze(0), save_path=output_path, min_duration=1)
+```