DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action Paper • 2511.22134 • Published Nov 27, 2025 • 21
Look-Back: Implicit Visual Re-focusing in MLLM Reasoning Paper • 2507.03019 • Published Jul 2, 2025 • 1
Can Understanding and Generation Truly Benefit Together -- or Just Coexist? Paper • 2509.09666 • Published Sep 11, 2025 • 34
FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation Paper • 2509.25187 • Published Sep 29, 2025 • 2
GIR-Bench: Versatile Benchmark for Generating Images with Reasoning Paper • 2510.11026 • Published Oct 13, 2025 • 17
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback Paper • 2510.16888 • Published Oct 19, 2025 • 21
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models Paper • 2510.01304 • Published Oct 1, 2025 • 10
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion Paper • 2502.08590 • Published Feb 12, 2025 • 42
Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings Paper • 2506.04997 • Published Jun 5, 2025
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing Paper • 2506.19848 • Published Jun 24, 2025 • 26
Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models Paper • 2508.00819 • Published Aug 1, 2025 • 62
Next Patch Prediction for Autoregressive Visual Generation Paper • 2412.15321 • Published Dec 19, 2024 • 1
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses Paper • 2412.00397 • Published Nov 30, 2024
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation Paper • 2503.07265 • Published Mar 10, 2025 • 4
SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video Paper • 2503.09154 • Published Mar 12, 2025
OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation Paper • 2505.20292 • Published May 26, 2025 • 52
ImgEdit: A Unified Image Editing Dataset and Benchmark Paper • 2505.20275 • Published May 26, 2025 • 18
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation Paper • 2506.03147 • Published Jun 3, 2025 • 58
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning Paper • 2505.22019 • Published May 28, 2025 • 11