UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist Paper • 2511.08521 • Published 3 days ago • 22
Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding Paper • 2509.11866 • Published Sep 15 • 1
VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models Paper • 2508.12081 • Published Aug 16
On Path to Multimodal Generalist: General-Level and General-Bench Paper • 2505.04620 • Published May 7 • 82
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models Paper • 2504.13122 • Published Apr 17 • 20
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization Paper • 2503.23377 • Published Mar 30 • 57
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation Paper • 2503.24379 • Published Mar 31 • 76
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper • 2503.12605 • Published Mar 16 • 35
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Paper • 2412.19806 • Published Oct 8, 2024 • 2
Reasoning Implicit Sentiment with Chain-of-Thought Prompting Paper • 2305.11255 • Published May 18, 2023 • 2
MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter Paper • 2310.12798 • Published Oct 19, 2023 • 4
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning Paper • 2311.18651 • Published Nov 30, 2023
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation Paper • 2308.05095 • Published Aug 9, 2023
Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models Paper • 2308.13812 • Published Aug 26, 2023 • 1
Faithful Logical Reasoning via Symbolic Chain-of-Thought Paper • 2405.18357 • Published May 28, 2024 • 2
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27, 2024 • 54
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis Paper • 2408.09481 • Published Aug 18, 2024 • 1
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning Paper • 2402.11435 • Published Feb 18, 2024