Audio-to-Audio
Safetensors
speech
audio
tokenizer
Aratako commited on
Commit
f05edfa
·
verified ·
1 Parent(s): 7a12088

Add files using upload-large-folder tool

Browse files
Files changed (4) hide show
  1. README.md +158 -0
  2. config.yaml +99 -0
  3. model.safetensors +3 -0
  4. vocoder_config.json +58 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - ja
6
+ - nl
7
+ - fr
8
+ - de
9
+ - it
10
+ - pl
11
+ - pt
12
+ - es
13
+ tags:
14
+ - speech
15
+ - audio
16
+ - tokenizer
17
+ datasets:
18
+ - sarulab-speech/mls_sidon
19
+ - mythicinfinity/Libriheavy-HQ
20
+ - nvidia/hifitts-2
21
+ pipeline_tag: audio-to-audio
22
+ ---
23
+
24
+ # MioCodec: High-Fidelity 44.1kHz Neural Audio Codec for Efficient Spoken Language Modeling
25
+
26
+ [![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/MioCodec)
27
+
28
+ **MioCodec-25Hz** is a high-fidelity neural audio codec designed for efficient spoken language modeling. Based on the [Kanade-Tokenizer](https://github.com/frothywater/kanade-tokenizer) implementation, MioCodec extends the capabilities to 44.1 kHz sampling rates, providing superior audio quality while maintaining a very low token rate.
29
+
30
+ ## 🌟 Overview
31
+
32
+ MioCodec decomposes speech into two distinct components:
33
+
34
+ 1. **Content Tokens:** Discrete representations that primarily capture linguistic information and phonetic content ("what" is being said) at a low frame rate (25 Hz).
35
+ 2. **Global Embeddings:** A continuous vector representing broad acoustic characteristics ("how")—including speaker identity, recording environment, and microphone traits.
36
+
37
+ By disentangling these elements, MioCodec is ideal for **Spoken Language Modeling**.
38
+
39
+ ### Key features
40
+
41
+ * **High-Resolution:** Supports **44.1 kHz** audio (compared to the standard 24 kHz in Kanade).
42
+ * **Ultra-Low Bitrate:** Achieves high-fidelity reconstruction at only **341 bps**.
43
+ * **End-to-End Optimization:** Unlike original two-stage approaches, the codec and vocoder are jointly fine-tuned to minimize waveform artifacts and jitter.
44
+
45
+ ## 📊 Model Comparison
46
+
47
+ | Model | Token Rate | Vocab Size | Bit Rate | Sample Rate | SSL Encoder | Vocoder | Parameters |
48
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
49
+ | **MioCodec-25Hz** | **25 Hz** | **12,800** | **341 bps** | **44.1 kHz** | **WavLM-base+** | **[MioVocoder](https://huggingface.co/Aratako/MioVocoder)** (Jointly Tuned) | **118M** |
50
+ | kanade-25hz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 118M |
51
+ | kanade-12.5hz | 12.5 Hz | 12,800 | 171 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 120M |
52
+
53
+ ## 🚀 Quick Start
54
+
55
+ ### Installation
56
+
57
+ ```bash
58
+ # Install via pip
59
+ pip install git+https://github.com/Aratako/MioCodec
60
+
61
+ # Or using uv
62
+ uv add git+https://github.com/Aratako/MioCodec
63
+
64
+ ```
65
+
66
+ ### Basic Inference
67
+
68
+ Basic usage for encoding and decoding audio:
69
+
70
+ ```python
71
+ from miocodec import MioCodec, load_audio
72
+ import soundfile as sf
73
+
74
+ # 1. Load model
75
+ model = MioCodec.from_pretrained("Aratako/MioCodec-25Hz").eval().cuda()
76
+
77
+ # 2. Load audio
78
+ waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()
79
+
80
+ # 3. Encode Audio
81
+ features = model.encode(waveform)
82
+
83
+ # 4. Decode to Waveform
84
+ resynth = model.decode(features=features)
85
+
86
+ # 5. Save
87
+ sf.write("output.wav", resynth.cpu().numpy(), samplerate=model.config.sample_rate)
88
+ ```
89
+
90
+ ### Voice Conversion (Zero-shot)
91
+
92
+ MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.
93
+
94
+ ```python
95
+ source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
96
+ reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()
97
+
98
+ # Perform conversion
99
+ vc_wave = model.voice_conversion(source, reference)
100
+ sf.write("converted.wav", vc_wave.cpu().numpy(), samplerate=model.config.sample_rate)
101
+ ```
102
+
103
+ ## 🏗️ Training Methodology
104
+
105
+ To achieve high-fidelity 44.1kHz reconstruction, MioCodec was trained in three phases. Phase 1 and 2 strictly follow the [original Kanade paper](https://openreview.net/pdf?id=dNUcKJEPTh) to establish feature alignment and spectral sharpness, while Phase 3 introduces a novel end-to-end waveform refinement stage.
106
+
107
+ ### Phase 1: Feature Alignment
108
+ This phase corresponds to the "Main Training Phase" described in the original paper. The model is trained to minimize both **Mel-spectrogram loss** and **SSL feature reconstruction loss** (using WavLM-base+). The vocoder is not utilized; the loss is computed directly on the predicted mel-spectrograms.
109
+
110
+ ### Phase 2: Adversarial Alignment
111
+ Following the "GAN Post-Training" phase of the original paper, we introduce adversarial training to sharpen the spectrograms. In this stage, the content branch is frozen, and only the decoder and global branch are updated. The model is trained using **Mel-spectrogram loss** combined with **GAN losses** (Adversarial + Feature Matching) applied in the mel domain.
112
+
113
+ ### Phase 3: End-to-End Waveform Refinement
114
+ To address residual artifacts such as jitter or tremor often found in mel-only training, we introduced a third phase that shifts the domain to raw waveforms.
115
+
116
+ In this phase, Vocoder is unfrozen, allowing the Codec Decoder and Vocoder to be fine-tuned jointly in an end-to-end manner. Similar to Phase 2, the content branch remains frozen. The training objective minimizes waveform artifacts using objectives adapted from [XCodec2](https://github.com/zhenye234/X-Codec-2.0) and [Inworld TTS-1](https://arxiv.org/html/2507.21138v1), with specific parameters tuned for 44.1 kHz:
117
+
118
+ * **Multi-Resolution Mel Spectrogram Loss:** Using window lengths of `[32, 64, 128, 256, 512, 1024, 2048, 4096]`.
119
+ * **Multi-Period Discriminator (MPD):** Using periods of `[2, 3, 5, 7, 11, 17, 23, 37]`.
120
+ * **Multi-Scale STFT Discriminator (MS-STFTD):** Using FFT sizes of `[216, 348, 568, 920, 1494, 2414, 3908, 6328]`.
121
+ * **RMS Loss:** Adopted from Inworld TTS-1 to stabilize energy and volume.
122
+
123
+ ## 📚 Training Data
124
+
125
+ The training datasets are listed below:
126
+
127
+ | Language | Approx. Hours | Dataset | Used in Phases |
128
+ | :--- | :--- | :--- | :---: |
129
+ | **Japanese** | ~15,000h | Various public HF datasets | 1, 2, 3 |
130
+ | **English** | ~500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ) | 1, 2, 3 |
131
+ | **English** | ~4000h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
132
+ | **English** | ~9000h | [HiFiTTS-2](https://huggingface.co/datasets/nvidia/hifitts-2) | 3 |
133
+ | **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
134
+ | **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
135
+ | **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
136
+ | **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
137
+ | **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
138
+ | **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
139
+ | **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
140
+
141
+ ## 📜 Acknowledgements
142
+
143
+ * **Codec Architecture:** Based on the brilliant work of [kanade-tokenizer](https://github.com/frothywater/kanade-tokenizer).
144
+ * **Vocoder Base:** Weights and codebase derived from [AliasingFreeNeuralAudioSynthesis](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis).
145
+ * **Training Techniques:** Phase 3 training objectives were heavily inspired by [XCodec2](https://github.com/zhenye234/X-Codec-2.0) and [Inworld TTS-1](https://arxiv.org/html/2507.21138v1).
146
+
147
+ ## 🖊️ Citation
148
+
149
+ ```bibtex
150
+ @misc{miocodec-25hz,
151
+ author = {Chihiro Arata},
152
+ title = {MioCodec: High-Fidelity 44.1kHz Neural Audio Codec},
153
+ year = {2026},
154
+ publisher = {Hugging Face},
155
+ journal = {Hugging Face repository},
156
+ howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz}}
157
+ }
158
+ ```
config.yaml ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ class_path: miocodec.model.MioCodecModel
3
+ init_args:
4
+ config:
5
+ # SSL Feature settings
6
+ local_ssl_layers: [6, 9]
7
+ global_ssl_layers: [1, 2]
8
+ normalize_ssl_features: true
9
+
10
+ # Down/up-sampling settings
11
+ downsample_factor: 2
12
+ mel_upsample_factor: 4
13
+ use_conv_downsample: true
14
+ mel_interpolation_mode: linear
15
+
16
+ # Audio settings (match Pupu-Vocoder mel specs)
17
+ sample_rate: 44100
18
+ n_fft: 2048
19
+ hop_length: 512
20
+ n_mels: 128
21
+ padding: same
22
+ mel_backend: pupu
23
+ mel_fmin: 0.0
24
+ mel_fmax: 22050.0
25
+ mel_win_length: 2048
26
+
27
+ ssl_feature_extractor:
28
+ class_path: miocodec.module.ssl_extractor.SSLFeatureExtractor
29
+ init_args:
30
+ model_name: wavlm_base_plus
31
+ output_layer: 9 # Use at most 9 layers
32
+ sample_rate: 44100 # Consistent to the target sample rate for reconstruction
33
+
34
+ local_encoder:
35
+ class_path: miocodec.module.transformer.Transformer
36
+ init_args:
37
+ dim: 768
38
+ n_layers: 6
39
+ n_heads: 12
40
+ window_size: 125
41
+ use_rope: true
42
+ rope_theta: 10000.0
43
+ max_seq_len: 512
44
+ use_flash_attention: true
45
+
46
+ local_quantizer:
47
+ class_path: miocodec.module.fsq.FiniteScalarQuantizer
48
+ init_args:
49
+ input_dim: 768 # Must match local encoder output dimension
50
+ output_dim: 768 # Must match feature decoder input dimension
51
+ levels: [8, 8, 8, 5, 5] # 12800
52
+
53
+ feature_decoder: null
54
+
55
+ global_encoder:
56
+ class_path: miocodec.module.global_encoder.GlobalEncoder
57
+ init_args:
58
+ input_channels: 768 # WavLM base plus feature dimension
59
+ output_channels: 128
60
+ num_layers: 4
61
+ dim: 384
62
+ intermediate_dim: 1152
63
+
64
+ mel_prenet:
65
+ class_path: miocodec.module.transformer.Transformer
66
+ init_args:
67
+ dim: 768
68
+ output_dim: 512
69
+ n_layers: 6
70
+ n_heads: 12
71
+ window_size: 65
72
+ use_rope: true
73
+ rope_theta: 10000.0
74
+ max_seq_len: 512
75
+ use_flash_attention: true
76
+
77
+ mel_decoder:
78
+ class_path: miocodec.module.transformer.Transformer
79
+ init_args:
80
+ dim: 512
81
+ output_dim: 128 # Number of mel frequency bins
82
+ n_layers: 6
83
+ n_heads: 8
84
+ window_size: 65
85
+ use_rope: true
86
+ rope_theta: 10000.0
87
+ max_seq_len: 512
88
+ adanorm_condition_dim: 128 # Must match global encoder output dimension
89
+ use_adaln_zero: true # Use AdaLNZero for conditioning
90
+ use_flash_attention: true
91
+
92
+ mel_postnet:
93
+ class_path: miocodec.module.postnet.PostNet
94
+ init_args:
95
+ input_channels: 128 # Number of mel frequency bins
96
+ channels: 256
97
+ kernel_size: 7
98
+ num_layers: 4
99
+ use_layer_norm: true
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e523449750ad0c9f66c08a5ece0ea60291bb8e615abcc0c46577f56a1c2beb62
3
+ size 532367664
vocoder_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "base_config": "egs/exp_config_pupuvocoder_base.json",
3
+ "model_type": "PupuVocoder",
4
+ "model": {
5
+ "generator": "pupuvocoder",
6
+ "pupuvocoder": {
7
+ "resblock": "1",
8
+ "upsample_rates": [
9
+ 8,
10
+ 8,
11
+ 2,
12
+ 2,
13
+ 2
14
+ ],
15
+ "upsample_kernel_sizes": [
16
+ 16,
17
+ 16,
18
+ 4,
19
+ 4,
20
+ 4
21
+ ],
22
+ "upsample_initial_channel": 512,
23
+ "resblock_kernel_sizes": [
24
+ 3,
25
+ 7,
26
+ 11
27
+ ],
28
+ "resblock_dilation_sizes": [
29
+ [
30
+ 1,
31
+ 3,
32
+ 5
33
+ ],
34
+ [
35
+ 1,
36
+ 3,
37
+ 5
38
+ ],
39
+ [
40
+ 1,
41
+ 3,
42
+ 5
43
+ ]
44
+ ]
45
+ }
46
+ },
47
+ "train": {
48
+ "criterions": [
49
+ "feature",
50
+ "discriminator",
51
+ "generator",
52
+ "multimel"
53
+ ]
54
+ },
55
+ "inference": {
56
+ "batch_size": 1
57
+ }
58
+ }