Multimodal LLM
updated
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
• 2401.00908
• Published
• 189
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
Paper
• 2401.00849
• Published
• 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
• 2311.05437
• Published
• 51
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
Generation and Editing
Paper
• 2311.00571
• Published
• 43
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper
• 2401.02330
• Published
• 18
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
• 2312.17172
• Published
• 30
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Paper
• 2206.08916
• Published
• 1
ImageBind: One Embedding Space To Bind Them All
Paper
• 2305.05665
• Published
• 6
Distilling Vision-Language Models on Millions of Videos
Paper
• 2401.06129
• Published
• 18
LEGO:Language Enhanced Multi-modal Grounding Model
Paper
• 2401.06071
• Published
• 12
Improving fine-grained understanding in image-text pre-training
Paper
• 2401.09865
• Published
• 18
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
• 2402.05935
• Published
• 17
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
• 2402.06118
• Published
• 15
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
• 2402.07865
• Published
• 15
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
• 2402.13577
• Published
• 9
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
• 2402.13232
• Published
• 16
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
• 2402.14289
• Published
• 20
Enhancing Vision-Language Pre-training with Rich Supervisions
Paper
• 2403.03346
• Published
• 17
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
• 2403.11703
• Published
• 17
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
• 2403.13447
• Published
• 19
When Do We Not Need Larger Vision Models?
Paper
• 2403.13043
• Published
• 26
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient
Inference
Paper
• 2403.14520
• Published
• 35
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
• 2403.14624
• Published
• 53
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published
• 77
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
• 2403.18814
• Published
• 48
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
• 2404.06512
• Published
• 30
OmniFusion Technical Report
Paper
• 2404.06212
• Published
• 77
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
• 2404.12390
• Published
• 26
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
• 2404.12387
• Published
• 39
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
and Generation
Paper
• 2404.14396
• Published
• 19
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
• 2404.16821
• Published
• 59
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
• 2404.16790
• Published
• 10
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published
• 37
What matters when building vision-language models?
Paper
• 2405.02246
• Published
• 103
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published
• 90
Matryoshka Multimodal Models
Paper
• 2405.17430
• Published
• 34
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
• 2405.15738
• Published
• 46
Needle In A Multimodal Haystack
Paper
• 2406.07230
• Published
• 54
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
• 2406.09961
• Published
• 55
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
• 2406.08418
• Published
• 32
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
• 2406.09403
• Published
• 23
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
• 2406.08707
• Published
• 17
CVQA: Culturally-diverse Multilingual Visual Question Answering
Benchmark
Paper
• 2406.05967
• Published
• 6
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
• 2406.11833
• Published
• 62
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
• 2406.11839
• Published
• 40
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
• 2406.11271
• Published
• 21
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
• 2407.02392
• Published
• 23
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
• 2407.02477
• Published
• 24
Vision language models are blind
Paper
• 2407.06581
• Published
• 84
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
• 2406.16860
• Published
• 63
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
• 2407.06135
• Published
• 22
MAVIS: Mathematical Visual Instruction Tuning
Paper
• 2407.08739
• Published
• 32
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
• 2407.07053
• Published
• 47
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
• 2407.07895
• Published
• 42
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper
• 2407.09413
• Published
• 11
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
• 2407.16198
• Published
• 13
VILA^2: VILA Augmented VILA
Paper
• 2407.17453
• Published
• 41
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
• 2408.01800
• Published
• 92
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
• 2408.05211
• Published
• 50
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
• 2408.04840
• Published
• 33
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
• 2408.08872
• Published
• 101
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
• 2408.10188
• Published
• 52
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
• 2408.12528
• Published
• 51
Open-FinLLMs: Open Multimodal Large Language Models for Financial
Applications
Paper
• 2408.11878
• Published
• 64
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
• 2408.15998
• Published
• 86
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published
• 57
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published
• 95
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Paper
• 2408.15881
• Published
• 21
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
• 2409.02889
• Published
• 54
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
• 2409.03420
• Published
• 26
NVLM: Open Frontier-Class Multimodal LLMs
Paper
• 2409.11402
• Published
• 74
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
• 2409.12191
• Published
• 78
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
Resolution
Paper
• 2409.12961
• Published
• 25
Phantom of Latent for Large Language and Vision Models
Paper
• 2409.14713
• Published
• 29
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
• 2409.17146
• Published
• 121
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with
3D-awareness
Paper
• 2409.18125
• Published
• 34
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
• 2409.20566
• Published
• 55
MIO: A Foundation Model on Multimodal Tokens
Paper
• 2409.17692
• Published
• 53
Emu3: Next-Token Prediction is All You Need
Paper
• 2409.18869
• Published
• 97
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
• 2410.02712
• Published
• 37
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
Paper
• 2410.01744
• Published
• 27
TLDR: Token-Level Detective Reward Model for Large Vision Language
Models
Paper
• 2410.04734
• Published
• 18
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Paper
• 2410.11779
• Published
• 26
Baichuan-Omni Technical Report
Paper
• 2410.08565
• Published
• 87
From Generalist to Specialist: Adapting Vision Language Models via
Task-Specific Visual Instruction Tuning
Paper
• 2410.06456
• Published
• 37
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
• 2410.05993
• Published
• 111
Paper
• 2410.07073
• Published
• 69
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
• 2410.13848
• Published
• 35
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid
Visual Redundancy Reduction
Paper
• 2410.17247
• Published
• 47
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
• 2410.13861
• Published
• 56
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Paper
• 2410.16153
• Published
• 44
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
Paper
• 2410.15017
• Published
• 2
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Paper
• 2410.11190
• Published
• 22
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Paper
• 2410.18798
• Published
• 21
WAFFLE: Multi-Modal Model for Automated Front-End Development
Paper
• 2410.18362
• Published
• 13
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
Tuning
Paper
• 2410.17779
• Published
• 8
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language
Understanding
Paper
• 2410.17434
• Published
• 27
Document Parsing Unveiled: Techniques, Challenges, and Prospects for
Structured Information Extraction
Paper
• 2410.21169
• Published
• 30
Paper
• 2410.21276
• Published
• 87
Infinity-MM: Scaling Multimodal Performance with Large-Scale and
High-Quality Instruction Data
Paper
• 2410.18558
• Published
• 18
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
• 2410.23218
• Published
• 49
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
• 2411.04923
• Published
• 23
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published
• 129
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of
Experts
Paper
• 2411.10669
• Published
• 10
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
• 2411.10640
• Published
• 46
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published
• 89
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
• 2411.17991
• Published
• 5
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Paper
• 2405.20797
• Published
• 32
X-Prompt: Towards Universal In-Context Image Generation in
Auto-Regressive Vision Language Foundation Models
Paper
• 2412.01824
• Published
• 64
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
• 2412.00927
• Published
• 29
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
• 2411.19930
• Published
• 31
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual
Preferences
Paper
• 2412.01292
• Published
• 13
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
• 2412.03555
• Published
• 133
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
Generation
Paper
• 2412.03069
• Published
• 34
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene
Understanding
Paper
• 2412.00493
• Published
• 17
VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
Paper
• 2411.19103
• Published
• 21
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
• 2412.05271
• Published
• 160
NVILA: Efficient Frontier Visual Language Models
Paper
• 2412.04468
• Published
• 60
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published
• 62
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
• 2412.08443
• Published
• 38
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published
• 54
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published
• 147
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
Paper
• 2412.07769
• Published
• 30
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
• 2412.08635
• Published
• 49
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary
Embedding Distillation
Paper
• 2412.09585
• Published
• 11
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced
Multimodal Understanding
Paper
• 2412.10302
• Published
• 22
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
• 2412.09604
• Published
• 38
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of
Thought and Look-ahead Spatial Reasoning
Paper
• 2412.11974
• Published
• 10
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via
Hierarchical Window Transformer
Paper
• 2412.13871
• Published
• 18
Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Paper
• 2412.14233
• Published
• 6
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
• 2412.17451
• Published
• 42
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
• 2412.18319
• Published
• 39
Video-Panda: Parameter-efficient Alignment for Encoder-free
Video-Language Models
Paper
• 2412.18609
• Published
• 17
Molar: Multimodal LLMs with Collaborative Filtering Alignment for
Enhanced Sequential Recommendation
Paper
• 2412.18176
• Published
• 16
Explanatory Instructions: Towards Unified Vision Tasks Understanding and
Zero-shot Generalization
Paper
• 2412.18525
• Published
• 74
On the Compositional Generalization of Multimodal LLMs for Medical
Imaging
Paper
• 2412.20070
• Published
• 42
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Paper
• 2412.18619
• Published
• 60
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Paper
• 2501.01957
• Published
• 47
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Paper
• 2501.01904
• Published
• 33
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for
Real-World Video Super-Resolution
Paper
• 2501.02976
• Published
• 56
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
• 2501.03895
• Published
• 52
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
• 2501.04686
• Published
• 53
An Empirical Study of Autoregressive Pre-training from Videos
Paper
• 2501.05453
• Published
• 41
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
• 2501.06186
• Published
• 65
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Paper
• 2501.06282
• Published
• 53
A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction
Following
Paper
• 2501.08187
• Published
• 27
Tarsier2: Advancing Large Vision-Language Models from Detailed Video
Description to Comprehensive Video Understanding
Paper
• 2501.07888
• Published
• 15
Parameter-Inverted Image Pyramid Networks for Visual Perception and
Multimodal Understanding
Paper
• 2501.07783
• Published
• 8
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper
• 2501.09012
• Published
• 10
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published
• 34
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
• 2501.09747
• Published
• 28
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
• 2501.11733
• Published
• 28
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published
• 90
Baichuan-Omni-1.5 Technical Report
Paper
• 2501.15368
• Published
• 60
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal
Understanding
Paper
• 2502.01341
• Published
• 39
MetaMorph: Multimodal Understanding and Generation via Instruction
Tuning
Paper
• 2412.14164
• Published
• 4
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
Modality Alignment
Paper
• 2502.04328
• Published
• 29
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Paper
• 2502.05173
• Published
• 64
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and
Generation
Paper
• 2502.05415
• Published
• 20
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Paper
• 2502.06788
• Published
• 13
Éclair -- Extracting Content and Layout with Integrated Reading Order
for Documents
Paper
• 2502.04223
• Published
• 10
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and
Generation
Paper
• 2502.12148
• Published
• 17
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
• 2502.12900
• Published
• 86
Magma: A Foundation Model for Multimodal AI Agents
Paper
• 2502.13130
• Published
• 58
HealthGPT: A Medical Large Vision-Language Model for Unifying
Comprehension and Generation via Heterogeneous Knowledge Adaptation
Paper
• 2502.09838
• Published
• 11
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published
• 214
UniTok: A Unified Tokenizer for Visual Generation and Understanding
Paper
• 2502.20321
• Published
• 30
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
• 2503.01743
• Published
• 89
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
• 2503.04130
• Published
• 96
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding
and Expert Reasoning Abilities
Paper
• 2503.03983
• Published
• 27
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web
Search
Paper
• 2503.10582
• Published
• 24
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model
with Interleaved Multimodal Generation via Asymmetric Synergy
Paper
• 2503.06542
• Published
• 7
Being-0: A Humanoid Robotic Agent with Vision-Language Models and
Modular Skills
Paper
• 2503.12533
• Published
• 68
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play
Visual Games with Keyboards and Mouse
Paper
• 2503.16365
• Published
• 41
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper
• 2503.18931
• Published
• 30
Scaling Vision Pre-Training to 4K Resolution
Paper
• 2503.19903
• Published
• 41
Qwen2.5-Omni Technical Report
Paper
• 2503.20215
• Published
• 170
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal
LLMs on Academic Resources
Paper
• 2504.00595
• Published
• 37
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Paper
• 2504.00557
• Published
• 15
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models
with Unsupervised Coefficient Optimization
Paper
• 2503.23733
• Published
• 10
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and
Diffusion Refinement
Paper
• 2504.01934
• Published
• 22
Scaling Analysis of Interleaved Speech-Text Language Models
Paper
• 2504.02398
• Published
• 31
ShortV: Efficient Multimodal Large Language Models by Freezing Visual
Tokens in Ineffective Layers
Paper
• 2504.00502
• Published
• 26
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Paper
• 2504.01328
• Published
• 7
SmolVLM: Redefining small and efficient multimodal models
Paper
• 2504.05299
• Published
• 205
Paper
• 2504.07491
• Published
• 137
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published
• 306
FUSION: Fully Integration of Vision-Language Representations for Deep
Cross-Modal Understanding
Paper
• 2504.09925
• Published
• 39
Mavors: Multi-granularity Video Representation for Multimodal Large
Language Model
Paper
• 2504.10068
• Published
• 30
Eagle 2.5: Boosting Long-Context Post-Training for Frontier
Vision-Language Models
Paper
• 2504.15271
• Published
• 67
An LMM for Efficient Video Understanding via Reinforced Compression of
Video Cubes
Paper
• 2504.15270
• Published
• 9
Vidi: Large Multimodal Models for Video Understanding and Editing
Paper
• 2504.15681
• Published
• 14
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Paper
• 2504.16082
• Published
• 5
Breaking the Modality Barrier: Universal Embedding Learning with
Multimodal LLMs
Paper
• 2504.17432
• Published
• 40
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery
Simulation
Paper
• 2504.17207
• Published
• 30
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Paper
• 2504.17040
• Published
• 13
Kimi-Audio Technical Report
Paper
• 2504.18425
• Published
• 20
MMInference: Accelerating Pre-filling for Long-Context VLMs via
Modality-Aware Permutation Sparse Attention
Paper
• 2504.16083
• Published
• 8
YoChameleon: Personalized Vision and Language Generation
Paper
• 2504.20998
• Published
• 12
UniBiomed: A Universal Foundation Model for Grounded Biomedical Image
Interpretation
Paper
• 2504.21336
• Published
• 5
Voila: Voice-Language Foundation Models for Real-Time Autonomous
Interaction and Voice Role-Play
Paper
• 2505.02707
• Published
• 85
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive
Streaming Speech Synthesis
Paper
• 2505.02625
• Published
• 23
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and
Attack-Defense Evaluation
Paper
• 2505.01456
• Published
• 2
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient
Large Speech-Language Model
Paper
• 2505.03739
• Published
• 9
Unified Multimodal Understanding and Generation Models: Advances,
Challenges, and Opportunities
Paper
• 2505.02567
• Published
• 80
On Path to Multimodal Generalist: General-Level and General-Bench
Paper
• 2505.04620
• Published
• 82
StreamBridge: Turning Your Offline Video Large Language Model into a
Proactive Streaming Assistant
Paper
• 2505.05467
• Published
• 13
Seed1.5-VL Technical Report
Paper
• 2505.07062
• Published
• 155
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,
Training and Dataset
Paper
• 2505.09568
• Published
• 99
Aya Vision: Advancing the Frontier of Multilingual Multimodality
Paper
• 2505.08751
• Published
• 13
Bring Reason to Vision: Understanding Perception and Reasoning through
Model Merging
Paper
• 2505.05464
• Published
• 11
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Paper
• 2505.09439
• Published
• 10
End-to-End Vision Tokenizer Tuning
Paper
• 2505.10562
• Published
• 22
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper
• 2412.13303
• Published
• 75
Emerging Properties in Unified Multimodal Pretraining
Paper
• 2505.14683
• Published
• 133
QuickVideo: Real-Time Long Video Understanding with System Algorithm
Co-Design
Paper
• 2505.16175
• Published
• 42
Backdoor Cleaning without External Guidance in MLLM Fine-tuning
Paper
• 2505.16916
• Published
• 17
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal
Large Language Models
Paper
• 2505.17015
• Published
• 9
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Paper
• 2505.21334
• Published
• 21
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware
Multi-Segment Grounding
Paper
• 2505.20715
• Published
• 2
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial
Intelligence
Paper
• 2505.23747
• Published
• 69
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Paper
• 2505.23762
• Published
• 45
VidText: Towards Comprehensive Evaluation for Video Text Understanding
Paper
• 2505.22810
• Published
• 19
TokBench: Evaluating Your Visual Tokenizer before Visual Generation
Paper
• 2505.18142
• Published
• 2
Don't Look Only Once: Towards Multimodal Interactive Reasoning with
Selective Visual Revisitation
Paper
• 2505.18842
• Published
• 36
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual
Large Language Models
Paper
• 2505.20873
• Published
• 9
ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and
Understanding
Paper
• 2506.01853
• Published
• 32
OmniResponse: Online Multimodal Conversational Response Generation in
Dyadic Interactions
Paper
• 2505.21724
• Published
• 5
Aligning VLM Assistants with Personalized Situated Cognition
Paper
• 2506.00930
• Published
• 2
MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech
Paralinguistic and Affect Labeling
Paper
• 2505.15772
• Published
• 3
Visual Embodied Brain: Let Multimodal Large Language Models See, Think,
and Control in Spaces
Paper
• 2506.00123
• Published
• 35
Paper
• 2506.03569
• Published
• 80
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Paper
• 2506.05344
• Published
• 17
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal
Contextual Fusion
Paper
• 2506.01111
• Published
• 31
Is Extending Modality The Right Path Towards Omni-Modality?
Paper
• 2506.01872
• Published
• 24
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical
Understanding and Reasoning
Paper
• 2506.07044
• Published
• 113
MIRAGE: Multimodal foundation model and benchmark for comprehensive
retinal OCT image analysis
Paper
• 2506.08900
• Published
• 4
Ming-Omni: A Unified Multimodal Model for Perception and Generation
Paper
• 2506.09344
• Published
• 31
Stream-Omni: Simultaneous Multimodal Interactions with Large
Language-Vision-Speech Model
Paper
• 2506.13642
• Published
• 27
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
Paper
• 2506.05336
• Published
• 9
GenRecal: Generation after Recalibration from Large to Small
Vision-Language Models
Paper
• 2506.15681
• Published
• 42
CoMemo: LVLMs Need Image Context with Image Memory
Paper
• 2506.06279
• Published
• 8
MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal
Models
Paper
• 2506.14435
• Published
• 7
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal
Large Language Models
Paper
• 2506.14824
• Published
• 8
Show-o2: Improved Native Unified Multimodal Models
Paper
• 2506.15564
• Published
• 29
UniFork: Exploring Modality Alignment for Unified Multimodal
Understanding and Generation
Paper
• 2506.17202
• Published
• 10
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video
Understanding
Paper
• 2506.15745
• Published
• 14
OmniGen2: Exploration to Advanced Multimodal Generation
Paper
• 2506.18871
• Published
• 78
Vision as a Dialect: Unifying Visual Understanding and Generation via
Text-Aligned Representations
Paper
• 2506.18898
• Published
• 34
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image
Generation
Paper
• 2506.18095
• Published
• 66
LLaVA-Scissor: Token Compression with Semantic Connected Components for
Video LLMs
Paper
• 2506.21862
• Published
• 36
ShotBench: Expert-Level Cinematic Understanding in Vision-Language
Models
Paper
• 2506.21356
• Published
• 22
Paper
• 2506.23044
• Published
• 61
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence
with Spatial Reasoning and Understanding
Paper
• 2506.23219
• Published
• 7
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published
• 131
μ^2Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for
Radiology Report Generation
Paper
• 2507.00316
• Published
• 15
MARVIS: Modality Adaptive Reasoning over VISualizations
Paper
• 2507.01544
• Published
• 13
Scaling RL to Long Videos
Paper
• 2507.07966
• Published
• 160
Multi-Granular Spatio-Temporal Token Merging for Training-Free
Acceleration of Video LLMs
Paper
• 2507.07990
• Published
• 46
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal
Large Language Models
Paper
• 2507.12566
• Published
• 15
Paper
• 2507.13264
• Published
• 32
Pixels, Patterns, but No Poetry: To See The World like Humans
Paper
• 2507.16863
• Published
• 69
Region-based Cluster Discrimination for Visual Representation Learning
Paper
• 2507.20025
• Published
• 19
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image
Generative Models Great Again
Paper
• 2507.22058
• Published
• 40
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Paper
• 2507.23779
• Published
• 45
Qwen-Image Technical Report
Paper
• 2508.02324
• Published
• 272
VeOmni: Scaling Any Modality Model Training with Model-Centric
Distributed Recipe Zoo
Paper
• 2508.02317
• Published
• 22
MELLA: Bridging Linguistic Capability and Cultural Groundedness for
Low-Resource Language MLLMs
Paper
• 2508.05502
• Published
• 6
UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and
Precise Inference-Time Grounding
Paper
• 2507.22025
• Published
• 4
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding
and Generation
Paper
• 2508.03320
• Published
• 63
LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
Paper
• 2508.03694
• Published
• 52
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with
Patch-level CLIP Latents
Paper
• 2508.05954
• Published
• 6
Paper
• 2508.11737
• Published
• 112
Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision
Mapping
Paper
• 2508.12466
• Published
• 8
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Paper
• 2508.05748
• Published
• 141
Intern-S1: A Scientific Multimodal Foundation Model
Paper
• 2508.15763
• Published
• 269
LLaSO: A Foundational Framework for Reproducible Research in Large
Language and Speech Model
Paper
• 2508.15418
• Published
• 8
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility,
Reasoning, and Efficiency
Paper
• 2508.18265
• Published
• 214
MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time
Autoregressive Video Generation
Paper
• 2508.19320
• Published
• 29
VibeVoice Technical Report
Paper
• 2508.19205
• Published
• 143
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models
for Document Conversion
Paper
• 2509.01215
• Published
• 51
Kwai Keye-VL 1.5 Technical Report
Paper
• 2509.01563
• Published
• 38
Visual Programmability: A Guide for Code-as-Thought in Chart
Understanding
Paper
• 2509.09286
• Published
• 11
Curia: A Multi-Modal Foundation Model for Radiology
Paper
• 2509.06830
• Published
• 21
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
Paper
• 2509.11543
• Published
• 49
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
• 2509.16197
• Published
• 58
AToken: A Unified Tokenizer for Vision
Paper
• 2509.14476
• Published
• 36
SAIL-VL2 Technical Report
Paper
• 2509.14033
• Published
• 44
Qwen3-Omni Technical Report
Paper
• 2509.17765
• Published
• 149
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
Training Recipe
Paper
• 2509.18154
• Published
• 55
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Paper
• 2509.20427
• Published
• 82
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
Paper
• 2509.21760
• Published
• 15
MinerU2.5: A Decoupled Vision-Language Model for Efficient
High-Resolution Document Parsing
Paper
• 2509.22186
• Published
• 146
CHURRO: Making History Readable with an Open-Weight Large
Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition
Paper
• 2509.19768
• Published
• 7
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Paper
• 2509.25131
• Published
• 16
HunyuanImage 3.0 Technical Report
Paper
• 2509.23951
• Published
• 25
Paper
• 2510.01141
• Published
• 121
UniVideo: Unified Understanding, Generation, and Editing for Videos
Paper
• 2510.08377
• Published
• 81
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language
Models under Data Constraints
Paper
• 2510.08565
• Published
• 21
InstructX: Towards Unified Visual Editing with MLLM Guidance
Paper
• 2510.08485
• Published
• 18
Ming-UniVision: Joint Image Understanding and Generation with a Unified
Continuous Tokenizer
Paper
• 2510.06590
• Published
• 77
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in
MLLMs
Paper
• 2510.01954
• Published
• 14
Thinking with Camera: A Unified Multimodal Model for Camera-Centric
Understanding and Generation
Paper
• 2510.08673
• Published
• 126
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper
• 2510.14528
• Published
• 118
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn
Dialogue
Paper
• 2510.13747
• Published
• 30
Scaling Language-Centric Omnimodal Representation Learning
Paper
• 2510.11693
• Published
• 104
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware
Finetuning and MLLM Implicit Feedback
Paper
• 2510.16888
• Published
• 22
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding
LLM
Paper
• 2510.15870
• Published
• 91
From Pixels to Words -- Towards Native Vision-Language Primitives at
Scale
Paper
• 2510.14979
• Published
• 67
olmOCR 2: Unit Test Rewards for Document OCR
Paper
• 2510.19817
• Published
• 16
DeepSeek-OCR: Contexts Optical Compression
Paper
• 2510.18234
• Published
• 93
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
Paper
• 2510.13251
• Published
• 14
Emu3.5: Native Multimodal Models are World Learners
Paper
• 2510.26583
• Published
• 111
JanusCoder: Towards a Foundational Visual-Programmatic Interface for
Code Intelligence
Paper
• 2510.23538
• Published
• 98
PairUni: Pairwise Training for Unified Multimodal Language Models
Paper
• 2510.25682
• Published
• 14
NVIDIA Nemotron Nano V2 VL
Paper
• 2511.03929
• Published
• 30
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large
Language Models
Paper
• 2511.07253
• Published
• 3
Music Flamingo: Scaling Music Understanding in Audio Language Models
Paper
• 2511.10289
• Published
• 17
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Paper
• 2511.12609
• Published
• 105
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
Paper
• 2511.13647
• Published
• 71
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
Paper
• 2511.08195
• Published
• 34
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution
Paper
• 2511.14210
• Published
• 21
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper
• 2511.14993
• Published
• 231
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Paper
• 2511.14582
• Published
• 19
Scaling Spatial Intelligence with Multimodal Foundation Models
Paper
• 2511.13719
• Published
• 47
HunyuanVideo 1.5 Technical Report
Paper
• 2511.18870
• Published
• 28
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Paper
• 2511.17490
• Published
• 22
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
Paper
• 2511.11007
• Published
• 15
NVIDIA Nemotron Parse 1.1
Paper
• 2511.20478
• Published
• 23
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Paper
• 2511.19413
• Published
• 20
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Paper
• 2511.21688
• Published
• 8
HunyuanOCR Technical Report
Paper
• 2511.19575
• Published
• 22
OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
Paper
• 2511.22055
• Published
• 8
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
Paper
• 2512.01342
• Published
• 18
OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion
Paper
• 2512.00234
• Published
• 1
TV2TV: A Unified Framework for Interleaved Language and Video Generation
Paper
• 2512.05103
• Published
• 20
Qwen3-VL Technical Report
Paper
• 2511.21631
• Published
• 158
Jina-VLM: Small Multilingual Vision Language Model
Paper
• 2512.04032
• Published
• 15
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Paper
• 2512.03794
• Published
• 5
EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
Paper
• 2512.04810
• Published
• 26
From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
Paper
• 2512.05277
• Published
• 6
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Paper
• 2512.08829
• Published
• 21
Kling-Omni Technical Report
Paper
• 2512.16776
• Published
• 170
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Paper
• 2512.13507
• Published
• 40
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
Paper
• 2512.14052
• Published
• 42
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Paper
• 2512.14698
• Published
• 21
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
Paper
• 2512.22905
• Published
• 20
LTX-2: Efficient Joint Audio-Visual Foundation Model
Paper
• 2601.03233
• Published
• 154
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
Paper
• 2601.03193
• Published
• 47
UM-Text: A Unified Multimodal Model for Image Understanding
Paper
• 2601.08321
• Published
• 10
VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
Paper
• 2601.07290
• Published
• 7
STEP3-VL-10B Technical Report
Paper
• 2601.09668
• Published
• 193
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Paper
• 2601.10611
• Published
• 29
FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs
Paper
• 2601.13836
• Published
• 35
LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR
Paper
• 2601.14251
• Published
• 24
GutenOCR: A Grounded Vision-Language Front-End for Documents
Paper
• 2601.14490
• Published
• 37
OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
Paper
• 2601.15369
• Published
• 21
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
Paper
• 2601.21639
• Published
• 50
Qwen3-ASR Technical Report
Paper
• 2601.21337
• Published
• 36
Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation
Paper
• 2601.21406
• Published
• 5
Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models
Paper
• 2601.12042
• Published
• 2
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
Paper
• 2601.19798
• Published
• 42
Innovator-VL: A Multimodal Large Language Model for Scientific Discovery
Paper
• 2601.19325
• Published
• 79
DeepSeek-OCR 2: Visual Causal Flow
Paper
• 2601.20552
• Published
• 63
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Paper
• 2601.21957
• Published
• 19
Kimi K2.5: Visual Agentic Intelligence
Paper
• 2602.02276
• Published
• 250
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Paper
• 2602.01785
• Published
• 94
ERNIE 5.0 Technical Report
Paper
• 2602.04705
• Published
• 260
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
Paper
• 2602.04804
• Published
• 46