VeridisQuo: Dual-Stream Deepfake Detection
Model Description
VeridisQuo ("Where is the truth?" in Latin) is a specialized neural network for detecting deepfake manipulations in face images. Unlike traditional approaches that rely solely on spatial features, VeridisQuo employs a dual-stream architecture that combines:
- Spatial analysis via EfficientNet-B4 for texture and semantic patterns
- Frequency-domain analysis via DCT/FFT transforms to detect compression artifacts and frequency anomalies
This hybrid approach exploits a fundamental weakness in deepfake generation: while GANs and diffusion models can produce visually convincing spatial patterns, they often leave telltale signatures in the frequency domain—particularly in how they distribute high-frequency components and introduce subtle compression inconsistencies.
Why Frequency Analysis Matters
Most deepfake generators operate in the spatial domain and are trained to fool human perception (and spatial-only CNNs). However:
- DCT (Discrete Cosine Transform) analysis reveals block-wise compression patterns that differ between real cameras and synthetic generation
- FFT (Fast Fourier Transform) exposes irregularities in the radial frequency distribution, particularly in high-frequency bands where GANs struggle to maintain natural camera sensor characteristics
- Real faces from camera sensors exhibit specific frequency signatures related to demosaicing, lens characteristics, and sensor noise—signatures that deepfakes fail to replicate accurately
By combining both streams, VeridisQuo achieves robust detection even against adversarially-trained generators.
Architecture
Input: RGB Image [3, 224, 224]
│
├─► Spatial Branch
│ └─► EfficientNet-B4 (pretrained on ImageNet)
│ └─► Global Average Pooling
│ └─► [1792-dim features]
│
└─► Frequency Branch
├─► DCT Extractor (8×8 blocks, frequency band aggregation)
│ └─► [512-dim features]
│
└─► FFT Extractor (8 radial bands, Hann windowing)
└─► [512-dim features]
Concatenate DCT + FFT
└─► Fusion MLP
└─► [1024-dim features]
Concatenate Spatial + Frequency
└─► LayerNorm
└─► [2816-dim combined features]
└─► Classification Head (3-layer MLP)
└─► [2 classes: FAKE / REAL]
Key Technical Details
- Total Parameters: 40,798,098 (all trainable)
- Model Size: ~156 MB
- Input Normalization: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
- Output: Logits for binary classification (FAKE=0, REAL=1)
- Framework: PyTorch 2.2+
Design Choices
LayerNorm over BatchNorm: The classifier uses LayerNorm instead of BatchNorm to support single-image inference (batch_size=1) without requiring running statistics.
Feature Dimension Balance: The 1792:1024 ratio between spatial and frequency features reflects their relative discriminative power—spatial features capture broader semantic context, while frequency features provide specialized forensic signals.
No Softmax in Forward Pass: The model outputs raw logits; apply torch.softmax(logits, dim=1) for probabilities.
Training Details
Dataset
Trained on FaceForensics++ (C23 compression), a benchmark dataset containing:
- 716,438 face images extracted from videos
- 4 deepfake techniques: Face2Face, FaceShifter, FaceSwap, NeuralTextures
- Split: 70% train (499,965) / 15% test (107,620) / 15% eval (108,853)
- Preprocessing: Faces detected via YOLOv8-based detector, cropped and aligned to 224×224
Dataset available on Kaggle: VeridisQuo Preprocessed Dataset
Hyperparameters
| Parameter | Value | Rationale |
|---|---|---|
| Optimizer | AdamW | Better weight decay regularization than Adam |
| Learning Rate | 1e-4 → 1e-6 | Cosine annealing with 3-epoch warmup |
| Batch Size | 64 | Optimal for 16GB GPU memory |
| Weight Decay | 1e-4 | L2 regularization to prevent overfitting |
| Epochs | 7 | Early stopping after validation loss plateau |
| Loss Function | CrossEntropyLoss | Standard for binary classification |
| Gradient Clipping | 1.0 | Prevents exploding gradients |
| Data Augmentation | HorizontalFlip (p=0.5), Rotation (±10°), ColorJitter | Improves generalization |
Training Infrastructure
- GPU: NVIDIA RTX 3090 (24GB VRAM)
- Duration: ~4 hours for 7 epochs
- Framework: PyTorch 2.2 with CUDA 12.1
- Precision: FP32 (mixed precision disabled for stability)
Regularization Strategy
- Dropout (0.2) in classifier layers
- Weight decay (1e-4) via AdamW
- Data augmentation during training
- Early stopping (patience=5 epochs)
- Gradient clipping (max_norm=1.0)
Maintainers: @Gazeux33 Repository: github.com/VeridisQuo-orga/VeridisQuo Contact: For questions or collaborations, open an issue on GitHub