VeridisQuo: Dual-Stream Deepfake Detection

PyTorch Parameters Accuracy

Model Description

VeridisQuo ("Where is the truth?" in Latin) is a specialized neural network for detecting deepfake manipulations in face images. Unlike traditional approaches that rely solely on spatial features, VeridisQuo employs a dual-stream architecture that combines:

  1. Spatial analysis via EfficientNet-B4 for texture and semantic patterns
  2. Frequency-domain analysis via DCT/FFT transforms to detect compression artifacts and frequency anomalies

This hybrid approach exploits a fundamental weakness in deepfake generation: while GANs and diffusion models can produce visually convincing spatial patterns, they often leave telltale signatures in the frequency domain—particularly in how they distribute high-frequency components and introduce subtle compression inconsistencies.

Why Frequency Analysis Matters

Most deepfake generators operate in the spatial domain and are trained to fool human perception (and spatial-only CNNs). However:

  • DCT (Discrete Cosine Transform) analysis reveals block-wise compression patterns that differ between real cameras and synthetic generation
  • FFT (Fast Fourier Transform) exposes irregularities in the radial frequency distribution, particularly in high-frequency bands where GANs struggle to maintain natural camera sensor characteristics
  • Real faces from camera sensors exhibit specific frequency signatures related to demosaicing, lens characteristics, and sensor noise—signatures that deepfakes fail to replicate accurately

By combining both streams, VeridisQuo achieves robust detection even against adversarially-trained generators.

Architecture

Input: RGB Image [3, 224, 224]
  │
  ├─► Spatial Branch
  │     └─► EfficientNet-B4 (pretrained on ImageNet)
  │           └─► Global Average Pooling
  │                 └─► [1792-dim features]
  │
  └─► Frequency Branch
        ├─► DCT Extractor (8×8 blocks, frequency band aggregation)
        │     └─► [512-dim features]
        │
        └─► FFT Extractor (8 radial bands, Hann windowing)
              └─► [512-dim features]

        Concatenate DCT + FFT
              └─► Fusion MLP
                    └─► [1024-dim features]

  Concatenate Spatial + Frequency
        └─► LayerNorm
              └─► [2816-dim combined features]
                    └─► Classification Head (3-layer MLP)
                          └─► [2 classes: FAKE / REAL]

Key Technical Details

  • Total Parameters: 40,798,098 (all trainable)
  • Model Size: ~156 MB
  • Input Normalization: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
  • Output: Logits for binary classification (FAKE=0, REAL=1)
  • Framework: PyTorch 2.2+

Design Choices

LayerNorm over BatchNorm: The classifier uses LayerNorm instead of BatchNorm to support single-image inference (batch_size=1) without requiring running statistics.

Feature Dimension Balance: The 1792:1024 ratio between spatial and frequency features reflects their relative discriminative power—spatial features capture broader semantic context, while frequency features provide specialized forensic signals.

No Softmax in Forward Pass: The model outputs raw logits; apply torch.softmax(logits, dim=1) for probabilities.

Training Details

Dataset

Trained on FaceForensics++ (C23 compression), a benchmark dataset containing:

  • 716,438 face images extracted from videos
  • 4 deepfake techniques: Face2Face, FaceShifter, FaceSwap, NeuralTextures
  • Split: 70% train (499,965) / 15% test (107,620) / 15% eval (108,853)
  • Preprocessing: Faces detected via YOLOv8-based detector, cropped and aligned to 224×224

Dataset available on Kaggle: VeridisQuo Preprocessed Dataset

Hyperparameters

Parameter Value Rationale
Optimizer AdamW Better weight decay regularization than Adam
Learning Rate 1e-4 → 1e-6 Cosine annealing with 3-epoch warmup
Batch Size 64 Optimal for 16GB GPU memory
Weight Decay 1e-4 L2 regularization to prevent overfitting
Epochs 7 Early stopping after validation loss plateau
Loss Function CrossEntropyLoss Standard for binary classification
Gradient Clipping 1.0 Prevents exploding gradients
Data Augmentation HorizontalFlip (p=0.5), Rotation (±10°), ColorJitter Improves generalization

Training Infrastructure

  • GPU: NVIDIA RTX 3090 (24GB VRAM)
  • Duration: ~4 hours for 7 epochs
  • Framework: PyTorch 2.2 with CUDA 12.1
  • Precision: FP32 (mixed precision disabled for stability)

Regularization Strategy

  1. Dropout (0.2) in classifier layers
  2. Weight decay (1e-4) via AdamW
  3. Data augmentation during training
  4. Early stopping (patience=5 epochs)
  5. Gradient clipping (max_norm=1.0)

Maintainers: @Gazeux33 Repository: github.com/VeridisQuo-orga/VeridisQuo Contact: For questions or collaborations, open an issue on GitHub

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support