Sherlock: Self-Correcting Reasoning in Vision-Language Models Paper • 2505.22651 • Published May 28 • 50
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data Paper • 2505.18445 • Published May 24 • 64
CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion Paper • 2401.14066 • Published Jan 25, 2024 • 11
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models Paper • 2401.13311 • Published Jan 24, 2024 • 12
SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection Paper • 2401.13160 • Published Jan 24, 2024 • 13
Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation Paper • 2401.14257 • Published Jan 25, 2024 • 12
UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion Paper • 2401.13388 • Published Jan 24, 2024 • 13
MaLA-500: Massive Language Adaptation of Large Language Models Paper • 2401.13303 • Published Jan 24, 2024 • 12
BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models Paper • 2401.13974 • Published Jan 25, 2024 • 14
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models Paper • 2401.13919 • Published Jan 25, 2024 • 32
MM-LLMs: Recent Advances in MultiModal Large Language Models Paper • 2401.13601 • Published Jan 24, 2024 • 48
Operationalizing Contextual Integrity in Privacy-Conscious Assistants Paper • 2408.02373 • Published Aug 5, 2024 • 5