KlonAudio

Open-source text-to-speech for European languages with voice cloning

About This Model

KlonAudio is a fork of kugelaudio/kugelaudio-0-open with restored voice cloning capabilities.

What's Different from the Original

The original KugelAudio model removed voice cloning functionality to reduce VRAM usage. This fork restores the full dual-encoder architecture (acoustic + semantic tokenizers) that enables:

  • โœจ Voice cloning from audio samples (5-10 seconds)
  • ๐ŸŽญ Three pre-encoded German voices (radio, angry, old_lady)
  • ๐Ÿ“ฆ Ready-to-use voice samples
  • โš™๏ธ Complete configuration files (preprocessor_config.json, tokenizer_config.json)

All credit for the base model training goes to the KugelAudio team (Kajo Kratzenstein, Carlos Menke). This fork simply re-enables features that existed in the original architecture.

Base model: kugelaudio/kugelaudio-0-open

Installation & Download

โš ๏ธ Important: This model is 18GB. We highly recommend pre-downloading it using the methods below for faster, more reliable downloads. Without pre-downloading, the first from_pretrained() call may be very slow or timeout.

Prerequisites

# Install git-xet for faster cloning (https://hf.co/docs/hub/git-xet)
brew install git-xet
git xet install

# Install HuggingFace CLI
curl -LsSf https://hf.co/cli/install.sh | bash

Option 1: Clone the Full Repository

# Clone with all model files (18GB download)
git clone https://huggingface.co/Roland-JAAI/klonaudio

Option 2: Clone Without Large Files (Recommended)

# Clone without downloading large files initially - just their pointers
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Roland-JAAI/klonaudio

# Then download the model files using HF CLI (faster and more reliable)
hf auth login --token <your-token>
hf download Roland-JAAI/klonaudio

Get your token: Create a free HuggingFace account and generate a token at https://huggingface.co/settings/tokens (read access is sufficient).

Option 3: Download Only Model Files

# Authenticate
hf auth login --token <your-token>

# Download just the model files to HuggingFace cache
hf download Roland-JAAI/klonaudio

After downloading, from_pretrained() will use the cached files instantly.

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

# Device selection (CUDA > MPS > CPU)
if torch.cuda.is_available():
    device = "cuda"
    dtype = torch.bfloat16
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
    dtype = torch.float32  # MPS doesn't support bfloat16 well
else:
    device = "cpu"
    dtype = torch.float32

print(f"Using device: {device}")

# Load model and processor (uses cached files if you pre-downloaded)
model = AutoModelForCausalLM.from_pretrained(
    "Roland-JAAI/klonaudio",
    trust_remote_code=True,
    torch_dtype=dtype,
).to(device)
model.eval()

processor = AutoProcessor.from_pretrained(
    "Roland-JAAI/klonaudio",
    trust_remote_code=True
)

# See available pre-encoded voices
print(processor.get_available_voices())  # ["radio", "angry", "old_lady"]

# Generate speech with a named voice
inputs = processor(
    text="Guten Abend. Hier sind die Nachrichten.",
    voice="radio",
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0, max_new_tokens=2048)

# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")

Voice Cloning

Clone any voice from a reference audio file:

# Clone from audio file (requires encoders - don't call strip_encoders())
inputs = processor(
    text="Your text here",
    voice_prompt="path/to/reference_audio.wav",
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0, max_new_tokens=2048)

processor.save_audio(outputs.speech_outputs[0], "cloned_output.wav")

Pre-encoded Voices

This model includes three pre-encoded German voices:

Voice Description Best For
radio Professional radio announcer Default/professional content
angry Angry, frustrated speech Emotional dialogue
old_lady Gentle elderly female Storytelling/warm content

Supported Languages

23 European languages with varying quality based on training data representation:

๐Ÿ‡ฉ๐Ÿ‡ช German | ๐Ÿ‡ฌ๐Ÿ‡ง English | ๐Ÿ‡ช๐Ÿ‡ธ Spanish | ๐Ÿ‡ซ๐Ÿ‡ท French | ๐Ÿ‡ฎ๐Ÿ‡น Italian | ๐Ÿ‡ต๐Ÿ‡น Portuguese | ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch | ๐Ÿ‡ต๐Ÿ‡ฑ Polish | ๐Ÿ‡ท๐Ÿ‡บ Russian | ๐Ÿ‡บ๐Ÿ‡ฆ Ukrainian | ๐Ÿ‡จ๐Ÿ‡ฟ Czech | ๐Ÿ‡ท๐Ÿ‡ด Romanian | ๐Ÿ‡ญ๐Ÿ‡บ Hungarian | ๐Ÿ‡ธ๐Ÿ‡ช Swedish | ๐Ÿ‡ฉ๐Ÿ‡ฐ Danish | ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish | ๐Ÿ‡ณ๐Ÿ‡ด Norwegian | ๐Ÿ‡ฌ๐Ÿ‡ท Greek | ๐Ÿ‡ง๐Ÿ‡ฌ Bulgarian | ๐Ÿ‡ธ๐Ÿ‡ฐ Slovak | ๐Ÿ‡ญ๐Ÿ‡ท Croatian | ๐Ÿ‡ท๐Ÿ‡ธ Serbian | ๐Ÿ‡น๐Ÿ‡ท Turkish

Note: Quality varies by language. German, Spanish, French, and English have the best coverage from ~200,000 hours of training data (YODAS2 dataset).

Model Details

  • Base Model: kugelaudio/kugelaudio-0-open
  • Architecture: AR + Diffusion hybrid (based on Microsoft VibeVoice)
  • Parameters: 7B
  • Model Size: ~18GB
  • Training Data: ~200,000 hours from YODAS2 dataset
  • Training Hardware: 8x NVIDIA H100 GPUs
  • Training Duration: 5 days
  • License: MIT

Differences from Base Model

This fork differs from kugelaudio/kugelaudio-0-open in the following ways:

  1. Voice Cloning Restored: Re-enabled acoustic and semantic encoders
  2. New Voice Files: Added three German pre-encoded voices (radio, angry, old_lady)
  3. Configuration Files: Added missing preprocessor_config.json and tokenizer_config.json
  4. Voice Samples: Included sample audio for each voice
  5. Documentation: Updated examples and documentation for voice cloning

The model weights themselves are identical to the base model. We only added the voice files and configurations that were missing.

Citation

@misc{klonaudio2026,
  title={KlonAudio: Open-Source TTS with Voice Cloning for European Languages},
  author={Roland Becker},
  year={2026},
  url={https://github.com/RolandJAAI/klonaudio}
}

@misc{kugelaudio2025,
  title={KugelAudio: Open-Source Text-to-Speech Model},
  author={Kajo Kratzenstein and Carlos Menke},
  year={2025},
  url={https://github.com/Kugelaudio/kugelaudio-open}
}

Acknowledgments

  • KugelAudio Team (Kajo Kratzenstein, Carlos Menke): For training the excellent base model and open-sourcing it under MIT license
  • Microsoft VibeVoice: For the original architecture with dual encoders
  • YODAS2 Dataset: For providing multilingual training data
  • HuggingFace: For model hosting and the transformers library

Links

Downloads last month
981
Safetensors
Model size
9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Roland-JAAI/klonaudio

Finetuned
(2)
this model