Configuration Parsing Warning: Invalid JSON for config file config.json

XTTS v2.0 โ€” Egyptian Arabic LoRA Adapter

A LoRA adapter for Coqui XTTS v2.0 fine-tuned on Egyptian Arabic conversational speech. Trained as part of a synthetic speech dataset generation pipeline that processes real audio clips into speaker-conditioned TTS models for scalable data augmentation.

Model Details

Base model Coqui XTTS v2.0
Method LoRA (PEFT)
Language Arabic โ€” Egyptian dialect
Task Text-to-Speech
License CPML (inherited from XTTS v2.0)

Training Configuration

Parameter Value
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
Target modules c_attn, c_proj, c_fc
Optimizer AdamW (ฮฒโ‚=0.9, ฮฒโ‚‚=0.96, wd=0.01)
Learning rate 1e-4
LR scheduler MultiStepLR (milestones: 480, 625; ฮณ=0.5)
Gradient clipping 5.0
Batch size 8 (grad accumulation: 32, effective: 256)
Epochs 15
Precision FP16
Hardware NVIDIA A100-80GB PCIe (Modal)
VRAM usage ~2.1 GB allocated
Training time 2.3 hours

Dataset

Clips 12,363 (12,240 train / 123 eval)
Language Egyptian Arabic
Source Multi-speaker conversational speech, processed through a custom preprocessing pipeline (source separation, diarization, forced alignment, quality filtering)
Audio format 22,050 Hz mono WAV

Training Results

Metric Epoch 1 Epoch 15 (final)
Eval mel_ce 4.151 3.750
Eval text_ce 0.039 0.032
Train mel_ce 4.523 3.758
Train text_ce 0.041 0.034

The LoRA adapter preserves the base model's multilingual capabilities while improving dialectal Egyptian Arabic synthesis. Full training logs available on Weights & Biases.

Files

File Description
best_model_merged.pth LoRA weights merged into base GPT checkpoint โ€” load directly with Coqui TTS
lora_adapters/adapter_model.safetensors Standalone LoRA adapter (61 MB)
lora_adapters/adapter_config.json PEFT adapter configuration
config.json Full Coqui Trainer configuration

Usage

With merged checkpoint (recommended)

Requires the base XTTS v2.0 files (vocab.json, dvae.pth, mel_stats.pth) from coqui/XTTS-v2.

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
config.load_json("config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="best_model_merged.pth",
    vocab_path="vocab.json",       # from coqui/XTTS-v2
    eval=True,
    use_deepspeed=False,
)
model.cuda()

outputs = model.synthesize(
    "ุงู„ู†ู‡ุงุฑุฏู‡ ุญู†ุชูƒู„ู… ุนู† ู…ูˆุถูˆุน ู…ู‡ู… ุฌุฏุงู‹",
    config,
    speaker_wav="reference.wav",
    language="ar",
)

Limitations

  • Inherits XTTS v2.0 architectural constraints: only the GPT-2 decoder is trainable; the HiFi-GAN vocoder and DVAE remain frozen.
  • Best results with 6โ€“12 seconds of clean reference audio from the target speaker.
  • Optimized for Egyptian Arabic; performance on other Arabic dialects is untested.

Acknowledgments

Built with Coqui TTS, PEFT, and Modal for cloud GPU training. Experiment tracking via Weights & Biases.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MAdel121/xtts-v2-egyptian-arabic-lora

Base model

coqui/XTTS-v2
Adapter
(1)
this model