VeridisQuo: Dual-Stream Deepfake Detection

Model Description

VeridisQuo ("Where is the truth?" in Latin) is a specialized neural network for detecting deepfake manipulations in face images. Unlike traditional approaches that rely solely on spatial features, VeridisQuo employs a dual-stream architecture that combines:

Spatial analysis via EfficientNet-B4 for texture and semantic patterns
Frequency-domain analysis via DCT/FFT transforms to detect compression artifacts and frequency anomalies

This hybrid approach exploits a fundamental weakness in deepfake generation: while GANs and diffusion models can produce visually convincing spatial patterns, they often leave telltale signatures in the frequency domain—particularly in how they distribute high-frequency components and introduce subtle compression inconsistencies.

Why Frequency Analysis Matters

Most deepfake generators operate in the spatial domain and are trained to fool human perception (and spatial-only CNNs). However:

DCT (Discrete Cosine Transform) analysis reveals block-wise compression patterns that differ between real cameras and synthetic generation
FFT (Fast Fourier Transform) exposes irregularities in the radial frequency distribution, particularly in high-frequency bands where GANs struggle to maintain natural camera sensor characteristics
Real faces from camera sensors exhibit specific frequency signatures related to demosaicing, lens characteristics, and sensor noise—signatures that deepfakes fail to replicate accurately

By combining both streams, VeridisQuo achieves robust detection even against adversarially-trained generators.

Architecture

Input: RGB Image [3, 224, 224]
  │
  ├─► Spatial Branch
  │     └─► EfficientNet-B4 (pretrained on ImageNet)
  │           └─► Global Average Pooling
  │                 └─► [1792-dim features]
  │
  └─► Frequency Branch
        ├─► DCT Extractor (8×8 blocks, frequency band aggregation)
        │     └─► [512-dim features]
        │
        └─► FFT Extractor (8 radial bands, Hann windowing)
              └─► [512-dim features]

        Concatenate DCT + FFT
              └─► Fusion MLP
                    └─► [1024-dim features]

  Concatenate Spatial + Frequency
        └─► LayerNorm
              └─► [2816-dim combined features]
                    └─► Classification Head (3-layer MLP)
                          └─► [2 classes: FAKE / REAL]

Key Technical Details

Total Parameters: 40,798,098 (all trainable)
Model Size: ~156 MB
Input Normalization: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
Output: Logits for binary classification (FAKE=0, REAL=1)
Framework: PyTorch 2.2+

Design Choices

LayerNorm over BatchNorm: The classifier uses LayerNorm instead of BatchNorm to support single-image inference (batch_size=1) without requiring running statistics.

Feature Dimension Balance: The 1792:1024 ratio between spatial and frequency features reflects their relative discriminative power—spatial features capture broader semantic context, while frequency features provide specialized forensic signals.

No Softmax in Forward Pass: The model outputs raw logits; apply torch.softmax(logits, dim=1) for probabilities.

Training Details

Dataset

Trained on FaceForensics++ (C23 compression), a benchmark dataset containing:

716,438 face images extracted from videos
4 deepfake techniques: Face2Face, FaceShifter, FaceSwap, NeuralTextures
Split: 70% train (499,965) / 15% test (107,620) / 15% eval (108,853)
Preprocessing: Faces detected via YOLOv8-based detector, cropped and aligned to 224×224

Dataset available on Kaggle: VeridisQuo Preprocessed Dataset

Hyperparameters

Parameter	Value	Rationale
Optimizer	AdamW	Better weight decay regularization than Adam
Learning Rate	1e-4 → 1e-6	Cosine annealing with 3-epoch warmup
Batch Size	64	Optimal for 16GB GPU memory
Weight Decay	1e-4	L2 regularization to prevent overfitting
Epochs	7	Early stopping after validation loss plateau
Loss Function	CrossEntropyLoss	Standard for binary classification
Gradient Clipping	1.0	Prevents exploding gradients
Data Augmentation	HorizontalFlip (p=0.5), Rotation (±10°), ColorJitter	Improves generalization

Training Infrastructure

GPU: NVIDIA RTX 3090 (24GB VRAM)
Duration: ~4 hours for 7 epochs
Framework: PyTorch 2.2 with CUDA 12.1
Precision: FP32 (mixed precision disabled for stability)

Regularization Strategy

Dropout (0.2) in classifier layers
Weight decay (1e-4) via AdamW
Data augmentation during training
Early stopping (patience=5 epochs)
Gradient clipping (max_norm=1.0)

Maintainers: @Gazeux33 Repository: github.com/VeridisQuo-orga/VeridisQuo Contact: For questions or collaborations, open an issue on GitHub

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support