foggyforest commited on
Commit
2713bc4
·
verified ·
1 Parent(s): 1aed72e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +221 -0
README.md ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - zh
6
+ tags:
7
+ - MoE
8
+ - Unified Generation
9
+ - Speech and Music
10
+ - Multi-modal
11
+ ---
12
+
13
+
14
+ <h1 align="center">UniMoE-Audio</h1>
15
+
16
+ **UniMoE-Audio** is a unified framework that seamlessly combines speech and music generation. Powered by a novel Dynamic-Capacity Mixture-of-Experts architecture.
17
+
18
+ <div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
19
+ <a href="https://mukioxun.github.io/Uni-MoE-site/home.html"><img src="https://img.shields.io/badge/📰 -Website-228B22" style="margin-right: 5px;"></a>
20
+ <a href="docs/UniMoE_Audio-Paper.pdf"><img src="https://img.shields.io/badge/📄-Paper-8A2BE2" style="margin-right: 5px;"></a>
21
+ </div>
22
+
23
+
24
+
25
+ ## Model Information
26
+ - **Base Model**: Qwen2.5-VL with MoE extensions
27
+ - **Audio Codec**: DAC (Descript Audio Codec) with 12 channels
28
+ - **Expert Configuration**: 8 routed experts + 2 shared experts + 1 null expert
29
+ - **Audio Sampling Rate**: 16kHz
30
+ - Usage:
31
+ - Text-to-Speech (TTS)
32
+ - Music Generation
33
+ - GPU Requirements:
34
+ - Memory: 16GB+
35
+ - CUDA-enabled GPU
36
+
37
+ ## Open-source Plan
38
+ - [x] Model Checkpoint
39
+ - [x] [UniMoE-Audio-preview](https://huggingface.co/foggyforest/UniMoE-Audio-preview)
40
+ - [ ] [UniMoE-Audio]()
41
+ - [x] Training and Inference Code: [HITsz-TMG/UniMoE-Audio](https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs/tree/master/UniMoE-Audio)
42
+ - [x] Technical Report: [UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE]()
43
+
44
+ ## Evaluation
45
+ ### Speech Synthesis
46
+ ![Speech Synthesis](./imgs/Speech_Generation.png)
47
+ ### Text to Music Generation
48
+ ![Text to Music Generation](./imgs/T2M.png)
49
+ ### Video-Text to Music Generation
50
+ ![Video-Text to Music Generation](./imgs/VT2M.png)
51
+
52
+ ## Requirements
53
+ We recommend using conda to install the environment.
54
+ ```bash
55
+ conda env create -f configs/enviroment.yml # add -n for your name
56
+ conda activate unimoe-audio # default name
57
+ ```
58
+ A `dac model` is also required to be downloaded in '/path/to/UniMoE-Audio/utils/dac_model'.
59
+ It will be automatically downloaded when running the first time.
60
+
61
+
62
+ ## Usage
63
+
64
+ Here is a code snippet to show you how to use UniMoE-Audio with `transformers`
65
+
66
+ ```python
67
+ import torch
68
+ import deepspeed_utils # This line is important, do not delete
69
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
70
+
71
+ # Import from utils modules
72
+ from utils import (
73
+ Dac,
74
+ preprocess_codec,
75
+ DecoderOutput,
76
+ tts_preprocess,
77
+ t2m_preprocess,
78
+ v2m_preprocess,
79
+ prepare_audio_prompt,
80
+ generate_output
81
+ )
82
+
83
+ model_path = "/path/to/your/model"
84
+
85
+ dac = Dac()
86
+
87
+ model = AutoModelForCausalLM.from_pretrained(
88
+ model_path,
89
+ torch_dtype=torch.float32,
90
+ attn_implementation='sdpa',
91
+ trust_remote_code=True,
92
+ ).eval()
93
+ model = model.to('cuda')
94
+
95
+ processor = AutoProcessor.from_pretrained(model_path)
96
+
97
+ ```
98
+
99
+ ### TTS Example:
100
+
101
+ ```python
102
+ transcription = [
103
+ "The nature reserve covers only a small part of the marsh area.",
104
+ "我们基于动态容量混合专家框架,构建了一个统一语音和音乐生成模型。"
105
+ ]
106
+ prompt_wav = "/path/to/your/voice/prompt"
107
+ prompt_transcription = "content of your voice prompt"
108
+
109
+ prompt_codec = preprocess_codec(model, dac.encode(prompt_wav))
110
+ text_input, tts_generation_kwargs = tts_preprocess(transcription, prompt_codec, prompt_transcription, model.device)
111
+ source_input = processor.tokenizer(text_input, add_special_tokens=False, return_tensors="pt", padding=True).to(model.device)
112
+
113
+ prefill, prefill_steps = prepare_audio_prompt(model, audio_prompts=[None] * len(transcription))
114
+ dec_output = DecoderOutput(prefill, prefill_steps, model.device)
115
+
116
+ with torch.no_grad():
117
+ generated_codes, lengths_Bx = model.generate(
118
+ input_ids=source_input.input_ids,
119
+ attention_mask=source_input.attention_mask,
120
+ dec_output=dec_output,
121
+ max_tokens=10 * 50, # maximum duration of the generated audio is 10 seconds
122
+ min_tokens=1 * 50, # minimum duration of the generated audio is 1 seconds
123
+ temperature=1.0,
124
+ top_p=1.0,
125
+ cfg_filter_top_k=45,
126
+ do_sample=True,
127
+ use_cache=True,
128
+ **tts_generation_kwargs
129
+ )
130
+
131
+ audios = generate_output(model, generated_codes, lengths_Bx)
132
+ for i in range(len(audios)):
133
+ output_path = os.path.join(f"./generated_speech_{i}.wav")
134
+ dac.decode(audios[i].transpose(0, 1).unsqueeze(0), save_path=output_path, min_duration=1)
135
+
136
+ ```
137
+
138
+ ### T2M Example:
139
+ ```python
140
+ caption = [
141
+ "A retro-inspired synthwave track with a driving beat and nostalgic melodies. Perfect for cruising or late-night drives.",
142
+ "A mid-tempo electronic track with a driving beat and atmospheric synth textures. Ideal for background listening or a chill dance set."
143
+ ]
144
+
145
+ text_input, t2m_generation_kwargs = t2m_preprocess(caption)
146
+
147
+ source_input = processor.tokenizer(text_input, add_special_tokens=False, return_tensors="pt", padding=True).to(model.device)
148
+
149
+ prefill, prefill_steps = prepare_audio_prompt(model, audio_prompts=[None] * len(caption))
150
+ dec_output = DecoderOutput(prefill, prefill_steps, model.device)
151
+
152
+ with torch.no_grad():
153
+ generated_codes, lengths_Bx = model.generate(
154
+ input_ids=source_input.input_ids,
155
+ attention_mask=source_input.attention_mask,
156
+ dec_output=dec_output,
157
+ max_tokens=20 * 50, # maximum duration of the generated audio is 20 seconds
158
+ min_tokens=8 * 50, # minimum duration of the generated audio is 8 seconds
159
+ temperature=1.0,
160
+ top_p=1.0,
161
+ cfg_filter_top_k=45,
162
+ do_sample=True,
163
+ use_cache=True,
164
+ **t2m_generation_kwargs
165
+ )
166
+
167
+ audios = generate_output(model, generated_codes, lengths_Bx)
168
+ for i in range(len(audios)):
169
+ output_path = os.path.join(f"./generated_music_{i}.wav")
170
+ dac.decode(audios[i].transpose(0, 1).unsqueeze(0), save_path=output_path, min_duration=1)
171
+
172
+
173
+ ```
174
+
175
+ ### V2M Example:
176
+ ```python
177
+
178
+ caption = [
179
+ "A relaxing instrumental piece featuring a simple melody played on a synth flute. The track creates a calm and peaceful atmosphere.",
180
+ ]
181
+ video = [
182
+ "/path/to/your/video/path.mp4",
183
+ ]
184
+
185
+ text_input, video_inputs, fps_inputs, v2m_generation_kwargs = v2m_preprocess(caption, video)
186
+
187
+ source_input = processor(text=text_input, images=None, videos=video_inputs, fps=fps_inputs, padding=True, return_tensors="pt", do_resize=False)
188
+ source_input = source_input.to(model.device)
189
+
190
+ prefill, prefill_steps = prepare_audio_prompt(model, audio_prompts=[None] * len(caption))
191
+ dec_output = DecoderOutput(prefill, prefill_steps, model.device)
192
+
193
+ with torch.no_grad():
194
+ generated_codes, lengths_Bx = model.generate(
195
+ input_ids=source_input.input_ids,
196
+ pixel_values_videos=source_input.pixel_values_videos,
197
+ video_grid_thw=source_input.video_grid_thw,
198
+ second_per_grid_ts=source_input.second_per_grid_ts,
199
+ attention_mask=source_input.attention_mask,
200
+ dec_output=dec_output,
201
+ max_tokens=20 * 50, # maximum duration of the generated audio is 20 seconds
202
+ min_tokens=8 * 50, # minimum duration of the generated audio is 8 seconds
203
+ temperature=1.0,
204
+ top_p=1.0,
205
+ cfg_filter_top_k=45,
206
+ do_sample=True,
207
+ use_cache=True,
208
+ **v2m_generation_kwargs
209
+ )
210
+
211
+ audios = generate_output(model, generated_codes, lengths_Bx)
212
+ for i in range(len(audios)):
213
+ output_path = os.path.join(f"./generated_video_music_{i}.wav")
214
+ dac.decode(audios[i].transpose(0, 1).unsqueeze(0), save_path=output_path, min_duration=1)
215
+ ```
216
+
217
+
218
+
219
+
220
+
221
+