Depth Anything 3: Recovering the Visual Space from Any Views Paper • 2511.10647 • Published 5 days ago • 58
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning Paper • 2510.23473 • Published 22 days ago • 82
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping Paper • 2510.18927 • Published 28 days ago • 82
Durian: Dual Reference-guided Portrait Animation with Attribute Transfer Paper • 2509.04434 • Published Sep 4 • 10
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait Paper • 2412.01064 • Published Dec 2, 2024 • 47
TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis Paper • 2508.13618 • Published Aug 19 • 17
Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation Paper • 2508.17924 • Published Aug 25 • 14
MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation Paper • 2508.19320 • Published Aug 26 • 29
FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation Paper • 2508.11255 • Published Aug 15 • 11
DisTime: Distribution-based Time Representation for Video Large Language Models Paper • 2505.24329 • Published May 30 • 1
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO Paper • 2506.07464 • Published Jun 9 • 13
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding Paper • 2507.13353 • Published Jul 17 • 1
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs Paper • 2506.21862 • Published Jun 27 • 36
DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework Paper • 2508.02807 • Published Aug 4 • 13
Phi-Ground Tech Report: Advancing Perception in GUI Grounding Paper • 2507.23779 • Published Jul 31 • 44
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second Paper • 2507.10065 • Published Jul 14 • 24
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation Paper • 2507.09862 • Published Jul 14 • 49
CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering Paper • 2507.08776 • Published Jul 11 • 54