Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning Paper • 2510.27606 • Published 10 days ago • 27
SPARK: Synergistic Policy And Reward Co-Evolving Framework Paper • 2509.22624 • Published Sep 26 • 17
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning Paper • 2509.22647 • Published Sep 26 • 32
Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity Paper • 2508.05609 • Published Aug 7 • 29
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing Paper • 2506.19848 • Published Jun 24 • 26
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion Paper • 2502.08590 • Published Feb 12 • 43
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models Paper • 2410.17637 • Published Oct 23, 2024 • 36
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate Paper • 2410.07167 • Published Oct 9, 2024 • 39
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction Paper • 2410.17247 • Published Oct 22, 2024 • 47
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree Paper • 2410.16268 • Published Oct 21, 2024 • 69
Open-LLaVA-NeXT Collection Open-source implementation of the LLaVA-NeXT series with Open-LLaVA-NeXT repositary • 3 items • Updated May 29, 2024 • 4
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published Jun 17, 2024 • 63
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Paper • 2406.04325 • Published Jun 6, 2024 • 75
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model Paper • 2401.16420 • Published Jan 29, 2024 • 55