MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity Paper • 2511.03146 • Published 14 days ago • 7
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts Paper • 2511.04655 • Published 12 days ago • 7
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation Paper • 2511.03774 • Published 13 days ago • 12
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Paper • 2511.04570 • Published 12 days ago • 191
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents Paper • 2511.04307 • Published 12 days ago • 14
The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation Paper • 2510.23393 • Published 22 days ago • 20
A Survey of Data Agents: Emerging Paradigm or Overstated Hype? Paper • 2510.23587 • Published 22 days ago • 65
Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation Paper • 2510.21583 • Published 25 days ago • 30
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning Paper • 2510.20286 • Published 26 days ago • 23
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives Paper • 2510.20822 • Published 26 days ago • 38
Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures Paper • 2510.14616 • Published Oct 16 • 11