AI Music DeepFake Detector

Python 3.8+ PyTorch 2.7.1 License: MIT GitHub

Model Description

A hybrid deep learning system that combines an autoencoder with a transformer architecture to detect AI-generated music with 95% accuracy. The model achieves 100% recall on authentic music (zero false negatives) while maintaining 90% recall on AI-generated tracks.

Key Features:

  • Hybrid architecture (Autoencoder + Transformer)
  • 21.1M parameters, 80.41 MB model size
  • Trained on 400 balanced samples (GTZAN + Suno AI)
  • Zero false negatives for authentic music
  • Efficient mel-spectrogram based feature extraction

Model Architecture

Audio β†’ Mel-Spectrogram β†’ Autoencoder (Encoder + Decoder)
                              ↓
                          Latent Features β†’ Transformer (6 layers)
                                                ↓
                                            Fusion Layer β†’ Classifier β†’ [Real/AI]

Components:

  • Autoencoder: Encoder (1β†’32β†’64β†’128β†’256 channels) + Decoder (256β†’128β†’64β†’32β†’1)
  • Transformer: 6 layers, 8 attention heads, 768 hidden dimensions
  • Classifier: 4-layer MLP (768β†’512β†’256β†’128β†’2)
  • Loss Function: Combined (70% classification + 30% reconstruction)

Performance

Metric Real Music AI Music
Precision 90.91% 100.00%
Recall 100.00% 90.00%
F1-Score 95.24% 94.74%
Overall Accuracy 95.00%

Confusion Matrix:

  • Real Music: 30/30 correctly classified (0 false negatives)
  • AI Music: 27/30 correctly classified (3 false positives)

Training Details

  • Hardware: NVIDIA GeForce MX450 (2.15GB VRAM)
  • Framework: PyTorch 2.7.1 + CUDA 11.8
  • Epochs: 42 (early stopping, best at epoch 27)
  • Optimizer: AdamW (lr=0.0001, weight_decay=1e-5)
  • Scheduler: CosineAnnealingLR with 5-epoch warmup
  • Batch Size: 32
  • Dataset Split: 279 train / 61 validation / 60 test

Usage

import torch
import torchaudio
import torchaudio.transforms as T
from huggingface_hub import hf_hub_download

# Download model
model_path = hf_hub_download(repo_id="huzaifanasirrr/ai-music-deepfake-detector", 
                              filename="best_model.pth")

# Load model (you'll need the model architecture from the repo)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(model_path, map_location=device)

# Load and preprocess audio
audio, sr = torchaudio.load("audio_file.wav")
if sr != 22050:
    resampler = T.Resample(sr, 22050)
    audio = resampler(audio)

# Extract mel-spectrogram
mel_transform = T.MelSpectrogram(
    sample_rate=22050,
    n_fft=2048,
    hop_length=512,
    n_mels=128
)
mel_spec = mel_transform(audio)

# Normalize
mel_spec = (mel_spec - mel_spec.mean()) / (mel_spec.std() + 1e-8)

# Inference
model.eval()
with torch.no_grad():
    output = model(mel_spec.unsqueeze(0).to(device))
    prediction = torch.argmax(output, dim=1).item()
    
print("Real Music" if prediction == 0 else "AI Generated")

Dataset

Training Data:

  • GTZAN: 200 authentic music tracks (rock, pop, classical, jazz, etc.)
  • Suno AI: 200 AI-generated tracks across multiple genres
  • Total: 400 samples, 10 seconds each, 22.05 kHz sample rate

Splits:

  • Training: 279 samples (70%)
  • Validation: 61 samples (15%)
  • Test: 60 samples (15%)

Model Files

  • best_model.pth - PyTorch checkpoint (80.41 MB)
  • model_architecture.json - Complete model specifications
  • training_summary.json - Training history (42 epochs)
  • training_curves.png - Loss and accuracy visualization
  • confusion_matrix.png - Test set results
  • config.yaml - Full configuration
  • requirements.txt - Dependencies

Limitations

  • Trained on 10-second audio clips (longer tracks need segmentation)
  • Limited to 22.05 kHz sample rate
  • Dataset size: 400 samples (may not generalize to all music styles)
  • AI music limited to Suno AI generator (may not detect other generators)
  • 3 false positives (AI β†’ Real) in test set

Citation

If you use this model in your research, please cite:

@software{nasir2025aimusic,
  author = {Nasir, Huzaifa},
  title = {AI Music DeepFake Detector: A Hybrid Autoencoder-Transformer Approach},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/huzaifanasirrr/ai-music-deepfake-detector},
  note = {GitHub: https://github.com/Huzaifanasir95/AI-Music-DeepFake-Detector}
}

Author

Huzaifa Nasir
National University of Computer and Emerging Sciences (FAST-NUCES), Islamabad, Pakistan
πŸ“§ nasirhuzaifa95@gmail.com
πŸ”— GitHub Repository

License

MIT License - See LICENSE file for details.

Acknowledgments

Research conducted at FAST-NUCES Islamabad. Inspired by recent advances in audio deepfake detection and transformer-based architectures for audio processing.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support