Add files using upload-large-folder tool
Browse files- README.md +158 -0
- config.yaml +99 -0
- model.safetensors +3 -0
- vocoder_config.json +58 -0
README.md
ADDED
|
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- ja
|
| 6 |
+
- nl
|
| 7 |
+
- fr
|
| 8 |
+
- de
|
| 9 |
+
- it
|
| 10 |
+
- pl
|
| 11 |
+
- pt
|
| 12 |
+
- es
|
| 13 |
+
tags:
|
| 14 |
+
- speech
|
| 15 |
+
- audio
|
| 16 |
+
- tokenizer
|
| 17 |
+
datasets:
|
| 18 |
+
- sarulab-speech/mls_sidon
|
| 19 |
+
- mythicinfinity/Libriheavy-HQ
|
| 20 |
+
- nvidia/hifitts-2
|
| 21 |
+
pipeline_tag: audio-to-audio
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
# MioCodec: High-Fidelity 44.1kHz Neural Audio Codec for Efficient Spoken Language Modeling
|
| 25 |
+
|
| 26 |
+
[](https://github.com/Aratako/MioCodec)
|
| 27 |
+
|
| 28 |
+
**MioCodec-25Hz** is a high-fidelity neural audio codec designed for efficient spoken language modeling. Based on the [Kanade-Tokenizer](https://github.com/frothywater/kanade-tokenizer) implementation, MioCodec extends the capabilities to 44.1 kHz sampling rates, providing superior audio quality while maintaining a very low token rate.
|
| 29 |
+
|
| 30 |
+
## 🌟 Overview
|
| 31 |
+
|
| 32 |
+
MioCodec decomposes speech into two distinct components:
|
| 33 |
+
|
| 34 |
+
1. **Content Tokens:** Discrete representations that primarily capture linguistic information and phonetic content ("what" is being said) at a low frame rate (25 Hz).
|
| 35 |
+
2. **Global Embeddings:** A continuous vector representing broad acoustic characteristics ("how")—including speaker identity, recording environment, and microphone traits.
|
| 36 |
+
|
| 37 |
+
By disentangling these elements, MioCodec is ideal for **Spoken Language Modeling**.
|
| 38 |
+
|
| 39 |
+
### Key features
|
| 40 |
+
|
| 41 |
+
* **High-Resolution:** Supports **44.1 kHz** audio (compared to the standard 24 kHz in Kanade).
|
| 42 |
+
* **Ultra-Low Bitrate:** Achieves high-fidelity reconstruction at only **341 bps**.
|
| 43 |
+
* **End-to-End Optimization:** Unlike original two-stage approaches, the codec and vocoder are jointly fine-tuned to minimize waveform artifacts and jitter.
|
| 44 |
+
|
| 45 |
+
## 📊 Model Comparison
|
| 46 |
+
|
| 47 |
+
| Model | Token Rate | Vocab Size | Bit Rate | Sample Rate | SSL Encoder | Vocoder | Parameters |
|
| 48 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 49 |
+
| **MioCodec-25Hz** | **25 Hz** | **12,800** | **341 bps** | **44.1 kHz** | **WavLM-base+** | **[MioVocoder](https://huggingface.co/Aratako/MioVocoder)** (Jointly Tuned) | **118M** |
|
| 50 |
+
| kanade-25hz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 118M |
|
| 51 |
+
| kanade-12.5hz | 12.5 Hz | 12,800 | 171 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 120M |
|
| 52 |
+
|
| 53 |
+
## 🚀 Quick Start
|
| 54 |
+
|
| 55 |
+
### Installation
|
| 56 |
+
|
| 57 |
+
```bash
|
| 58 |
+
# Install via pip
|
| 59 |
+
pip install git+https://github.com/Aratako/MioCodec
|
| 60 |
+
|
| 61 |
+
# Or using uv
|
| 62 |
+
uv add git+https://github.com/Aratako/MioCodec
|
| 63 |
+
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### Basic Inference
|
| 67 |
+
|
| 68 |
+
Basic usage for encoding and decoding audio:
|
| 69 |
+
|
| 70 |
+
```python
|
| 71 |
+
from miocodec import MioCodec, load_audio
|
| 72 |
+
import soundfile as sf
|
| 73 |
+
|
| 74 |
+
# 1. Load model
|
| 75 |
+
model = MioCodec.from_pretrained("Aratako/MioCodec-25Hz").eval().cuda()
|
| 76 |
+
|
| 77 |
+
# 2. Load audio
|
| 78 |
+
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()
|
| 79 |
+
|
| 80 |
+
# 3. Encode Audio
|
| 81 |
+
features = model.encode(waveform)
|
| 82 |
+
|
| 83 |
+
# 4. Decode to Waveform
|
| 84 |
+
resynth = model.decode(features=features)
|
| 85 |
+
|
| 86 |
+
# 5. Save
|
| 87 |
+
sf.write("output.wav", resynth.cpu().numpy(), samplerate=model.config.sample_rate)
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### Voice Conversion (Zero-shot)
|
| 91 |
+
|
| 92 |
+
MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.
|
| 93 |
+
|
| 94 |
+
```python
|
| 95 |
+
source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
|
| 96 |
+
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()
|
| 97 |
+
|
| 98 |
+
# Perform conversion
|
| 99 |
+
vc_wave = model.voice_conversion(source, reference)
|
| 100 |
+
sf.write("converted.wav", vc_wave.cpu().numpy(), samplerate=model.config.sample_rate)
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
## 🏗️ Training Methodology
|
| 104 |
+
|
| 105 |
+
To achieve high-fidelity 44.1kHz reconstruction, MioCodec was trained in three phases. Phase 1 and 2 strictly follow the [original Kanade paper](https://openreview.net/pdf?id=dNUcKJEPTh) to establish feature alignment and spectral sharpness, while Phase 3 introduces a novel end-to-end waveform refinement stage.
|
| 106 |
+
|
| 107 |
+
### Phase 1: Feature Alignment
|
| 108 |
+
This phase corresponds to the "Main Training Phase" described in the original paper. The model is trained to minimize both **Mel-spectrogram loss** and **SSL feature reconstruction loss** (using WavLM-base+). The vocoder is not utilized; the loss is computed directly on the predicted mel-spectrograms.
|
| 109 |
+
|
| 110 |
+
### Phase 2: Adversarial Alignment
|
| 111 |
+
Following the "GAN Post-Training" phase of the original paper, we introduce adversarial training to sharpen the spectrograms. In this stage, the content branch is frozen, and only the decoder and global branch are updated. The model is trained using **Mel-spectrogram loss** combined with **GAN losses** (Adversarial + Feature Matching) applied in the mel domain.
|
| 112 |
+
|
| 113 |
+
### Phase 3: End-to-End Waveform Refinement
|
| 114 |
+
To address residual artifacts such as jitter or tremor often found in mel-only training, we introduced a third phase that shifts the domain to raw waveforms.
|
| 115 |
+
|
| 116 |
+
In this phase, Vocoder is unfrozen, allowing the Codec Decoder and Vocoder to be fine-tuned jointly in an end-to-end manner. Similar to Phase 2, the content branch remains frozen. The training objective minimizes waveform artifacts using objectives adapted from [XCodec2](https://github.com/zhenye234/X-Codec-2.0) and [Inworld TTS-1](https://arxiv.org/html/2507.21138v1), with specific parameters tuned for 44.1 kHz:
|
| 117 |
+
|
| 118 |
+
* **Multi-Resolution Mel Spectrogram Loss:** Using window lengths of `[32, 64, 128, 256, 512, 1024, 2048, 4096]`.
|
| 119 |
+
* **Multi-Period Discriminator (MPD):** Using periods of `[2, 3, 5, 7, 11, 17, 23, 37]`.
|
| 120 |
+
* **Multi-Scale STFT Discriminator (MS-STFTD):** Using FFT sizes of `[216, 348, 568, 920, 1494, 2414, 3908, 6328]`.
|
| 121 |
+
* **RMS Loss:** Adopted from Inworld TTS-1 to stabilize energy and volume.
|
| 122 |
+
|
| 123 |
+
## 📚 Training Data
|
| 124 |
+
|
| 125 |
+
The training datasets are listed below:
|
| 126 |
+
|
| 127 |
+
| Language | Approx. Hours | Dataset | Used in Phases |
|
| 128 |
+
| :--- | :--- | :--- | :---: |
|
| 129 |
+
| **Japanese** | ~15,000h | Various public HF datasets | 1, 2, 3 |
|
| 130 |
+
| **English** | ~500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ) | 1, 2, 3 |
|
| 131 |
+
| **English** | ~4000h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
|
| 132 |
+
| **English** | ~9000h | [HiFiTTS-2](https://huggingface.co/datasets/nvidia/hifitts-2) | 3 |
|
| 133 |
+
| **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
|
| 134 |
+
| **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
|
| 135 |
+
| **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
|
| 136 |
+
| **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
|
| 137 |
+
| **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
|
| 138 |
+
| **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
|
| 139 |
+
| **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | 1, 2 |
|
| 140 |
+
|
| 141 |
+
## 📜 Acknowledgements
|
| 142 |
+
|
| 143 |
+
* **Codec Architecture:** Based on the brilliant work of [kanade-tokenizer](https://github.com/frothywater/kanade-tokenizer).
|
| 144 |
+
* **Vocoder Base:** Weights and codebase derived from [AliasingFreeNeuralAudioSynthesis](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis).
|
| 145 |
+
* **Training Techniques:** Phase 3 training objectives were heavily inspired by [XCodec2](https://github.com/zhenye234/X-Codec-2.0) and [Inworld TTS-1](https://arxiv.org/html/2507.21138v1).
|
| 146 |
+
|
| 147 |
+
## 🖊️ Citation
|
| 148 |
+
|
| 149 |
+
```bibtex
|
| 150 |
+
@misc{miocodec-25hz,
|
| 151 |
+
author = {Chihiro Arata},
|
| 152 |
+
title = {MioCodec: High-Fidelity 44.1kHz Neural Audio Codec},
|
| 153 |
+
year = {2026},
|
| 154 |
+
publisher = {Hugging Face},
|
| 155 |
+
journal = {Hugging Face repository},
|
| 156 |
+
howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz}}
|
| 157 |
+
}
|
| 158 |
+
```
|
config.yaml
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model:
|
| 2 |
+
class_path: miocodec.model.MioCodecModel
|
| 3 |
+
init_args:
|
| 4 |
+
config:
|
| 5 |
+
# SSL Feature settings
|
| 6 |
+
local_ssl_layers: [6, 9]
|
| 7 |
+
global_ssl_layers: [1, 2]
|
| 8 |
+
normalize_ssl_features: true
|
| 9 |
+
|
| 10 |
+
# Down/up-sampling settings
|
| 11 |
+
downsample_factor: 2
|
| 12 |
+
mel_upsample_factor: 4
|
| 13 |
+
use_conv_downsample: true
|
| 14 |
+
mel_interpolation_mode: linear
|
| 15 |
+
|
| 16 |
+
# Audio settings (match Pupu-Vocoder mel specs)
|
| 17 |
+
sample_rate: 44100
|
| 18 |
+
n_fft: 2048
|
| 19 |
+
hop_length: 512
|
| 20 |
+
n_mels: 128
|
| 21 |
+
padding: same
|
| 22 |
+
mel_backend: pupu
|
| 23 |
+
mel_fmin: 0.0
|
| 24 |
+
mel_fmax: 22050.0
|
| 25 |
+
mel_win_length: 2048
|
| 26 |
+
|
| 27 |
+
ssl_feature_extractor:
|
| 28 |
+
class_path: miocodec.module.ssl_extractor.SSLFeatureExtractor
|
| 29 |
+
init_args:
|
| 30 |
+
model_name: wavlm_base_plus
|
| 31 |
+
output_layer: 9 # Use at most 9 layers
|
| 32 |
+
sample_rate: 44100 # Consistent to the target sample rate for reconstruction
|
| 33 |
+
|
| 34 |
+
local_encoder:
|
| 35 |
+
class_path: miocodec.module.transformer.Transformer
|
| 36 |
+
init_args:
|
| 37 |
+
dim: 768
|
| 38 |
+
n_layers: 6
|
| 39 |
+
n_heads: 12
|
| 40 |
+
window_size: 125
|
| 41 |
+
use_rope: true
|
| 42 |
+
rope_theta: 10000.0
|
| 43 |
+
max_seq_len: 512
|
| 44 |
+
use_flash_attention: true
|
| 45 |
+
|
| 46 |
+
local_quantizer:
|
| 47 |
+
class_path: miocodec.module.fsq.FiniteScalarQuantizer
|
| 48 |
+
init_args:
|
| 49 |
+
input_dim: 768 # Must match local encoder output dimension
|
| 50 |
+
output_dim: 768 # Must match feature decoder input dimension
|
| 51 |
+
levels: [8, 8, 8, 5, 5] # 12800
|
| 52 |
+
|
| 53 |
+
feature_decoder: null
|
| 54 |
+
|
| 55 |
+
global_encoder:
|
| 56 |
+
class_path: miocodec.module.global_encoder.GlobalEncoder
|
| 57 |
+
init_args:
|
| 58 |
+
input_channels: 768 # WavLM base plus feature dimension
|
| 59 |
+
output_channels: 128
|
| 60 |
+
num_layers: 4
|
| 61 |
+
dim: 384
|
| 62 |
+
intermediate_dim: 1152
|
| 63 |
+
|
| 64 |
+
mel_prenet:
|
| 65 |
+
class_path: miocodec.module.transformer.Transformer
|
| 66 |
+
init_args:
|
| 67 |
+
dim: 768
|
| 68 |
+
output_dim: 512
|
| 69 |
+
n_layers: 6
|
| 70 |
+
n_heads: 12
|
| 71 |
+
window_size: 65
|
| 72 |
+
use_rope: true
|
| 73 |
+
rope_theta: 10000.0
|
| 74 |
+
max_seq_len: 512
|
| 75 |
+
use_flash_attention: true
|
| 76 |
+
|
| 77 |
+
mel_decoder:
|
| 78 |
+
class_path: miocodec.module.transformer.Transformer
|
| 79 |
+
init_args:
|
| 80 |
+
dim: 512
|
| 81 |
+
output_dim: 128 # Number of mel frequency bins
|
| 82 |
+
n_layers: 6
|
| 83 |
+
n_heads: 8
|
| 84 |
+
window_size: 65
|
| 85 |
+
use_rope: true
|
| 86 |
+
rope_theta: 10000.0
|
| 87 |
+
max_seq_len: 512
|
| 88 |
+
adanorm_condition_dim: 128 # Must match global encoder output dimension
|
| 89 |
+
use_adaln_zero: true # Use AdaLNZero for conditioning
|
| 90 |
+
use_flash_attention: true
|
| 91 |
+
|
| 92 |
+
mel_postnet:
|
| 93 |
+
class_path: miocodec.module.postnet.PostNet
|
| 94 |
+
init_args:
|
| 95 |
+
input_channels: 128 # Number of mel frequency bins
|
| 96 |
+
channels: 256
|
| 97 |
+
kernel_size: 7
|
| 98 |
+
num_layers: 4
|
| 99 |
+
use_layer_norm: true
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e523449750ad0c9f66c08a5ece0ea60291bb8e615abcc0c46577f56a1c2beb62
|
| 3 |
+
size 532367664
|
vocoder_config.json
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"base_config": "egs/exp_config_pupuvocoder_base.json",
|
| 3 |
+
"model_type": "PupuVocoder",
|
| 4 |
+
"model": {
|
| 5 |
+
"generator": "pupuvocoder",
|
| 6 |
+
"pupuvocoder": {
|
| 7 |
+
"resblock": "1",
|
| 8 |
+
"upsample_rates": [
|
| 9 |
+
8,
|
| 10 |
+
8,
|
| 11 |
+
2,
|
| 12 |
+
2,
|
| 13 |
+
2
|
| 14 |
+
],
|
| 15 |
+
"upsample_kernel_sizes": [
|
| 16 |
+
16,
|
| 17 |
+
16,
|
| 18 |
+
4,
|
| 19 |
+
4,
|
| 20 |
+
4
|
| 21 |
+
],
|
| 22 |
+
"upsample_initial_channel": 512,
|
| 23 |
+
"resblock_kernel_sizes": [
|
| 24 |
+
3,
|
| 25 |
+
7,
|
| 26 |
+
11
|
| 27 |
+
],
|
| 28 |
+
"resblock_dilation_sizes": [
|
| 29 |
+
[
|
| 30 |
+
1,
|
| 31 |
+
3,
|
| 32 |
+
5
|
| 33 |
+
],
|
| 34 |
+
[
|
| 35 |
+
1,
|
| 36 |
+
3,
|
| 37 |
+
5
|
| 38 |
+
],
|
| 39 |
+
[
|
| 40 |
+
1,
|
| 41 |
+
3,
|
| 42 |
+
5
|
| 43 |
+
]
|
| 44 |
+
]
|
| 45 |
+
}
|
| 46 |
+
},
|
| 47 |
+
"train": {
|
| 48 |
+
"criterions": [
|
| 49 |
+
"feature",
|
| 50 |
+
"discriminator",
|
| 51 |
+
"generator",
|
| 52 |
+
"multimel"
|
| 53 |
+
]
|
| 54 |
+
},
|
| 55 |
+
"inference": {
|
| 56 |
+
"batch_size": 1
|
| 57 |
+
}
|
| 58 |
+
}
|