AI Music DeepFake Detector
Model Description
A hybrid deep learning system that combines an autoencoder with a transformer architecture to detect AI-generated music with 95% accuracy. The model achieves 100% recall on authentic music (zero false negatives) while maintaining 90% recall on AI-generated tracks.
Key Features:
- Hybrid architecture (Autoencoder + Transformer)
- 21.1M parameters, 80.41 MB model size
- Trained on 400 balanced samples (GTZAN + Suno AI)
- Zero false negatives for authentic music
- Efficient mel-spectrogram based feature extraction
Model Architecture
Audio β Mel-Spectrogram β Autoencoder (Encoder + Decoder)
β
Latent Features β Transformer (6 layers)
β
Fusion Layer β Classifier β [Real/AI]
Components:
- Autoencoder: Encoder (1β32β64β128β256 channels) + Decoder (256β128β64β32β1)
- Transformer: 6 layers, 8 attention heads, 768 hidden dimensions
- Classifier: 4-layer MLP (768β512β256β128β2)
- Loss Function: Combined (70% classification + 30% reconstruction)
Performance
| Metric | Real Music | AI Music |
|---|---|---|
| Precision | 90.91% | 100.00% |
| Recall | 100.00% | 90.00% |
| F1-Score | 95.24% | 94.74% |
| Overall Accuracy | 95.00% |
Confusion Matrix:
- Real Music: 30/30 correctly classified (0 false negatives)
- AI Music: 27/30 correctly classified (3 false positives)
Training Details
- Hardware: NVIDIA GeForce MX450 (2.15GB VRAM)
- Framework: PyTorch 2.7.1 + CUDA 11.8
- Epochs: 42 (early stopping, best at epoch 27)
- Optimizer: AdamW (lr=0.0001, weight_decay=1e-5)
- Scheduler: CosineAnnealingLR with 5-epoch warmup
- Batch Size: 32
- Dataset Split: 279 train / 61 validation / 60 test
Usage
import torch
import torchaudio
import torchaudio.transforms as T
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download(repo_id="huzaifanasirrr/ai-music-deepfake-detector",
filename="best_model.pth")
# Load model (you'll need the model architecture from the repo)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(model_path, map_location=device)
# Load and preprocess audio
audio, sr = torchaudio.load("audio_file.wav")
if sr != 22050:
resampler = T.Resample(sr, 22050)
audio = resampler(audio)
# Extract mel-spectrogram
mel_transform = T.MelSpectrogram(
sample_rate=22050,
n_fft=2048,
hop_length=512,
n_mels=128
)
mel_spec = mel_transform(audio)
# Normalize
mel_spec = (mel_spec - mel_spec.mean()) / (mel_spec.std() + 1e-8)
# Inference
model.eval()
with torch.no_grad():
output = model(mel_spec.unsqueeze(0).to(device))
prediction = torch.argmax(output, dim=1).item()
print("Real Music" if prediction == 0 else "AI Generated")
Dataset
Training Data:
- GTZAN: 200 authentic music tracks (rock, pop, classical, jazz, etc.)
- Suno AI: 200 AI-generated tracks across multiple genres
- Total: 400 samples, 10 seconds each, 22.05 kHz sample rate
Splits:
- Training: 279 samples (70%)
- Validation: 61 samples (15%)
- Test: 60 samples (15%)
Model Files
best_model.pth- PyTorch checkpoint (80.41 MB)model_architecture.json- Complete model specificationstraining_summary.json- Training history (42 epochs)training_curves.png- Loss and accuracy visualizationconfusion_matrix.png- Test set resultsconfig.yaml- Full configurationrequirements.txt- Dependencies
Limitations
- Trained on 10-second audio clips (longer tracks need segmentation)
- Limited to 22.05 kHz sample rate
- Dataset size: 400 samples (may not generalize to all music styles)
- AI music limited to Suno AI generator (may not detect other generators)
- 3 false positives (AI β Real) in test set
Citation
If you use this model in your research, please cite:
@software{nasir2025aimusic,
author = {Nasir, Huzaifa},
title = {AI Music DeepFake Detector: A Hybrid Autoencoder-Transformer Approach},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/huzaifanasirrr/ai-music-deepfake-detector},
note = {GitHub: https://github.com/Huzaifanasir95/AI-Music-DeepFake-Detector}
}
Author
Huzaifa Nasir
National University of Computer and Emerging Sciences (FAST-NUCES), Islamabad, Pakistan
π§ nasirhuzaifa95@gmail.com
π GitHub Repository
License
MIT License - See LICENSE file for details.
Acknowledgments
Research conducted at FAST-NUCES Islamabad. Inspired by recent advances in audio deepfake detection and transformer-based architectures for audio processing.
- Downloads last month
- 2