Configuration Parsing Warning: Invalid JSON for config file config.json

XTTS v2.0 — Egyptian Arabic LoRA Adapter

A LoRA adapter for Coqui XTTS v2.0 fine-tuned on Egyptian Arabic conversational speech. Trained as part of a synthetic speech dataset generation pipeline that processes real audio clips into speaker-conditioned TTS models for scalable data augmentation.

Model Details


Base model	Coqui XTTS v2.0
Method	LoRA (PEFT)
Language	Arabic — Egyptian dialect
Task	Text-to-Speech
License	CPML (inherited from XTTS v2.0)

Training Configuration

Parameter	Value
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
Target modules	`c_attn`, `c_proj`, `c_fc`
Optimizer	AdamW (β₁=0.9, β₂=0.96, wd=0.01)
Learning rate	1e-4
LR scheduler	MultiStepLR (milestones: 480, 625; γ=0.5)
Gradient clipping	5.0
Batch size	8 (grad accumulation: 32, effective: 256)
Epochs	15
Precision	FP16
Hardware	NVIDIA A100-80GB PCIe (Modal)
VRAM usage	~2.1 GB allocated
Training time	2.3 hours

Dataset


Clips	12,363 (12,240 train / 123 eval)
Language	Egyptian Arabic
Source	Multi-speaker conversational speech, processed through a custom preprocessing pipeline (source separation, diarization, forced alignment, quality filtering)
Audio format	22,050 Hz mono WAV

Training Results

Metric	Epoch 1	Epoch 15 (final)
Eval mel_ce	4.151	3.750
Eval text_ce	0.039	0.032
Train mel_ce	4.523	3.758
Train text_ce	0.041	0.034

The LoRA adapter preserves the base model's multilingual capabilities while improving dialectal Egyptian Arabic synthesis. Full training logs available on Weights & Biases.

Files

File	Description
`best_model_merged.pth`	LoRA weights merged into base GPT checkpoint — load directly with Coqui TTS
`lora_adapters/adapter_model.safetensors`	Standalone LoRA adapter (61 MB)
`lora_adapters/adapter_config.json`	PEFT adapter configuration
`config.json`	Full Coqui Trainer configuration

Usage

With merged checkpoint (recommended)

Requires the base XTTS v2.0 files (vocab.json, dvae.pth, mel_stats.pth) from coqui/XTTS-v2.

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
config.load_json("config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="best_model_merged.pth",
    vocab_path="vocab.json",       # from coqui/XTTS-v2
    eval=True,
    use_deepspeed=False,
)
model.cuda()

outputs = model.synthesize(
    "النهارده حنتكلم عن موضوع مهم جداً",
    config,
    speaker_wav="reference.wav",
    language="ar",
)

Limitations

Inherits XTTS v2.0 architectural constraints: only the GPT-2 decoder is trainable; the HiFi-GAN vocoder and DVAE remain frozen.
Best results with 6–12 seconds of clean reference audio from the target speaker.
Optimized for Egyptian Arabic; performance on other Arabic dialects is untested.

Acknowledgments

Built with Coqui TTS, PEFT, and Modal for cloud GPU training. Experiment tracking via Weights & Biases.

Downloads last month: 9

Model tree for MAdel121/xtts-v2-egyptian-arabic-lora

Base model

coqui/XTTS-v2

Adapter

(1)

this model