Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Paper • 2501.04001 • Published Jan 7, 2025 • 47
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper • 2501.03895 • Published Jan 7, 2025 • 52
An Empirical Study of Autoregressive Pre-training from Videos Paper • 2501.05453 • Published Jan 9, 2025 • 41
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training Paper • 2501.07556 • Published Jan 13, 2025 • 7
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos Paper • 2501.12375 • Published Jan 21, 2025 • 23
Intuitive physics understanding emerges from self-supervised pretraining on natural videos Paper • 2502.11831 • Published Feb 17, 2025 • 20
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks Paper • 2502.17157 • Published Feb 24, 2025 • 52
"Principal Components" Enable A New Language of Images Paper • 2503.08685 • Published Mar 11, 2025 • 12
What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization Paper • 2503.06698 • Published Mar 9, 2025 • 4
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution Paper • 2510.12747 • Published Oct 14, 2025 • 39
VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval Paper • 2602.08099 • Published 15 days ago • 121