Configuration Parsing Warning: Invalid JSON for config file config.json
XTTS v2.0 โ Egyptian Arabic LoRA Adapter
A LoRA adapter for Coqui XTTS v2.0 fine-tuned on Egyptian Arabic conversational speech. Trained as part of a synthetic speech dataset generation pipeline that processes real audio clips into speaker-conditioned TTS models for scalable data augmentation.
Model Details
| Base model | Coqui XTTS v2.0 |
| Method | LoRA (PEFT) |
| Language | Arabic โ Egyptian dialect |
| Task | Text-to-Speech |
| License | CPML (inherited from XTTS v2.0) |
Training Configuration
| Parameter | Value |
|---|---|
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | c_attn, c_proj, c_fc |
| Optimizer | AdamW (ฮฒโ=0.9, ฮฒโ=0.96, wd=0.01) |
| Learning rate | 1e-4 |
| LR scheduler | MultiStepLR (milestones: 480, 625; ฮณ=0.5) |
| Gradient clipping | 5.0 |
| Batch size | 8 (grad accumulation: 32, effective: 256) |
| Epochs | 15 |
| Precision | FP16 |
| Hardware | NVIDIA A100-80GB PCIe (Modal) |
| VRAM usage | ~2.1 GB allocated |
| Training time | 2.3 hours |
Dataset
| Clips | 12,363 (12,240 train / 123 eval) |
| Language | Egyptian Arabic |
| Source | Multi-speaker conversational speech, processed through a custom preprocessing pipeline (source separation, diarization, forced alignment, quality filtering) |
| Audio format | 22,050 Hz mono WAV |
Training Results
| Metric | Epoch 1 | Epoch 15 (final) |
|---|---|---|
| Eval mel_ce | 4.151 | 3.750 |
| Eval text_ce | 0.039 | 0.032 |
| Train mel_ce | 4.523 | 3.758 |
| Train text_ce | 0.041 | 0.034 |
The LoRA adapter preserves the base model's multilingual capabilities while improving dialectal Egyptian Arabic synthesis. Full training logs available on Weights & Biases.
Files
| File | Description |
|---|---|
best_model_merged.pth |
LoRA weights merged into base GPT checkpoint โ load directly with Coqui TTS |
lora_adapters/adapter_model.safetensors |
Standalone LoRA adapter (61 MB) |
lora_adapters/adapter_config.json |
PEFT adapter configuration |
config.json |
Full Coqui Trainer configuration |
Usage
With merged checkpoint (recommended)
Requires the base XTTS v2.0 files (vocab.json, dvae.pth, mel_stats.pth) from coqui/XTTS-v2.
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(
config,
checkpoint_path="best_model_merged.pth",
vocab_path="vocab.json", # from coqui/XTTS-v2
eval=True,
use_deepspeed=False,
)
model.cuda()
outputs = model.synthesize(
"ุงูููุงุฑุฏู ุญูุชููู
ุนู ู
ูุถูุน ู
ูู
ุฌุฏุงู",
config,
speaker_wav="reference.wav",
language="ar",
)
Limitations
- Inherits XTTS v2.0 architectural constraints: only the GPT-2 decoder is trainable; the HiFi-GAN vocoder and DVAE remain frozen.
- Best results with 6โ12 seconds of clean reference audio from the target speaker.
- Optimized for Egyptian Arabic; performance on other Arabic dialects is untested.
Acknowledgments
Built with Coqui TTS, PEFT, and Modal for cloud GPU training. Experiment tracking via Weights & Biases.
- Downloads last month
- 9
Model tree for MAdel121/xtts-v2-egyptian-arabic-lora
Base model
coqui/XTTS-v2