Papers
arxiv:2512.15693

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Published on Dec 17
ยท Submitted by
Yifei Li
on Dec 18
Authors:
,
,
,
,
,

Abstract

Skyra, a specialized multimodal large language model, detects and explains visual artifacts in AI-generated videos using a novel dataset and two-stage training strategy, outperforming existing methods.

AI-generated summary

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

Community

Paper author Paper submitter

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
https://huggingface.co/papers/2512.15693

Explainable AI-generated video detection with a specialized multimodal LLM. Given an input video, Skyra explicitly identifies human-perceivable spatio-temporal artifacts (e.g., texture/structure inconsistencies, motion irregularities) and uses them as grounded evidence to produce both a real/fake decision and a human-interpretable explanation with localized cues. To train this capability, we introduce ViF-CoT-4K, the first large-scale AI-generated video artifact dataset with fine-grained human annotations, enabling supervised fine-tuning (Skyra-SFT). We further apply a second-stage reinforcement learning procedure to encourage the model to actively mine discriminative artifacts, improving both detection and explanation quality (Skyra-RL). For rigorous evaluation, we release ViF-Bench (3K high-quality samples from 10+ state-of-the-art video generators) with aligned real/fake semantics and formats to reduce shortcut signals, and demonstrate consistent gains over prior binary detectors and MLLM-based baselines.
Learn more at https://joeleelyf.github.io/Skyra and https://github.com/JoeLeelyf/Skyra.
Screenshot 2025-12-18 at 11.00.07

Screenshot 2025-12-18 at 11.00.27

arXiv lens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/skyra-ai-generated-video-detection-via-grounded-artifact-reasoning-3336-22fb155d

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.15693 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.15693 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.15693 in a Space README.md to link it from this page.

Collections including this paper 1