OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation Paper • 2510.26213 • Published 17 days ago • 9
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Paper • 2509.22186 • Published Sep 26 • 131
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation Paper • 2508.09987 • Published Aug 13 • 25
EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity Paper • 2507.21848 • Published Jul 29 • 8
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs Paper • 2507.11097 • Published Jul 15 • 64
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence Paper • 2505.23747 • Published May 29 • 68
Shifting AI Efficiency From Model-Centric to Data-Centric Compression Paper • 2505.19147 • Published May 25 • 144
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning Paper • 2504.09641 • Published Apr 13 • 16
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding Paper • 2504.09925 • Published Apr 14 • 38
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation Paper • 2504.02782 • Published Apr 3 • 57
Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation Paper • 2503.14905 • Published Mar 19 • 20
LEGION: Learning to Ground and Explain for Synthetic Image Detection Paper • 2503.15264 • Published Mar 19 • 21
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published Feb 20 • 154
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions Paper • 2412.09596 • Published Dec 12, 2024 • 98
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Paper • 2407.03320 • Published Jul 3, 2024 • 95