| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - WpythonW/real-fake-voices-dataset2 |
| | - mozilla-foundation/common_voice_17_0 |
| | language: |
| | - en |
| | metrics: |
| | - accuracy |
| | - f1 |
| | - recall |
| | - precision |
| | base_model: |
| | - MIT/ast-finetuned-audioset-10-10-0.4593 |
| | pipeline_tag: audio-classification |
| | library_name: transformers |
| | tags: |
| | - audio |
| | - audio-classification |
| | - fake-audio-detection |
| | - ast |
| | widget: |
| | - text: Upload an audio file to check if it's real or synthetic |
| | inference: |
| | parameters: |
| | sampling_rate: 16000 |
| | audio_channel: mono |
| | model-index: |
| | - name: ast-fakeaudio-detector |
| | results: |
| | - task: |
| | type: audio-classification |
| | name: Audio Classification |
| | dataset: |
| | name: real-fake-voices-dataset2 |
| | type: WpythonW/real-fake-voices-dataset2 |
| | metrics: |
| | - type: accuracy |
| | value: 0.9662 |
| | - type: f1 |
| | value: 0.971 |
| | - type: precision |
| | value: 0.9692 |
| | - type: recall |
| | value: 0.9728 |
| | --- |
| | |
| | # AST Fine-tuned for Fake Audio Detection |
| |
|
| | This model is a binary classification head fine-tuned version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection. |
| |
|
| | ## Model Description |
| |
|
| | - **Base Model**: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet) |
| | - **Task**: Binary classification (fake/real audio detection) |
| | - **Input**: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames) |
| | - **Output**: Probabilities [fake_prob, real_prob] |
| | - **Training Hardware**: 2x NVIDIA T4 GPUs |
| |
|
| | # Usage Guide |
| |
|
| | ## Model Usage |
| | ```python |
| | import torch |
| | import torchaudio |
| | import soundfile as sf |
| | import numpy as np |
| | from transformers import AutoFeatureExtractor, AutoModelForAudioClassification |
| | |
| | # Load model and move to available device |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | model_name = "WpythonW/ast-fakeaudio-detector" |
| | |
| | extractor = AutoFeatureExtractor.from_pretrained(model_name) |
| | model = AutoModelForAudioClassification.from_pretrained(model_name).to(device) |
| | model.eval() |
| | |
| | # Process multiple audio files |
| | audio_files = ["audio1.wav", "audio2.mp3", "audio3.ogg"] |
| | processed_batch = [] |
| | |
| | for audio_path in audio_files: |
| | # Load audio file |
| | audio_data, sr = sf.read(audio_path) |
| | |
| | # Convert stereo to mono if needed |
| | if len(audio_data.shape) > 1 and audio_data.shape[1] > 1: |
| | audio_data = np.mean(audio_data, axis=1) |
| | |
| | # Resample to 16kHz if needed |
| | if sr != 16000: |
| | waveform = torch.from_numpy(audio_data).float() |
| | if len(waveform.shape) == 1: |
| | waveform = waveform.unsqueeze(0) |
| | |
| | resample = torchaudio.transforms.Resample( |
| | orig_freq=sr, |
| | new_freq=16000 |
| | ) |
| | waveform = resample(waveform) |
| | audio_data = waveform.squeeze().numpy() |
| | |
| | processed_batch.append(audio_data) |
| | |
| | # Prepare batch input |
| | inputs = extractor( |
| | processed_batch, |
| | sampling_rate=16000, |
| | padding=True, |
| | return_tensors="pt" |
| | ) |
| | inputs = {k: v.to(device) for k, v in inputs.items()} |
| | |
| | # Get predictions |
| | with torch.no_grad(): |
| | logits = model(**inputs).logits |
| | probabilities = torch.nn.functional.softmax(logits, dim=-1) |
| | |
| | # Process results |
| | for filename, probs in zip(audio_files, probabilities): |
| | fake_prob = float(probs[0].cpu()) |
| | real_prob = float(probs[1].cpu()) |
| | prediction = "FAKE" if fake_prob > real_prob else "REAL" |
| | |
| | print(f"\nFile: {filename}") |
| | print(f"Fake probability: {fake_prob:.2%}") |
| | print(f"Real probability: {real_prob:.2%}") |
| | print(f"Verdict: {prediction}") |
| | ``` |
| |
|
| | ## Limitations |
| |
|
| | Important considerations when using this model: |
| | 1. The model works with 16kHz audio input |
| | 2. Performance may vary with different types of audio manipulation not present in training data |
| | 3. The model was trained on audio samples ranging from 4 to 10 seconds in duration. |