Vision Language Models: 2025 Update

sergiopaniego 's Collections

📝 Research & Long-Form Blog Posts

Amazing design resources

Vision reasoning datasets

GUI Grounding datasets

My vision Spaces

👁 Vision comparison ftw

😎 Awesome vision Spaces

Vision Language Models: 2025 Update

updated May 12, 2025

This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update

Upvote

Qwen/Qwen2.5-Omni-7B

Any-to-Any • Updated Apr 30, 2025 • 361k • 1.87k
Running

Featured

366

Qwen2.5 Omni 7B Demo

🏆

366

Chat with AI using text, audio, images, and video
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26, 2025 • 170
openbmb/MiniCPM-o-2_6

Any-to-Any • Updated Oct 5, 2025 • 106k • 1.28k
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1, 2025 • 28k • 3.57k
Runtime error

Featured

2.02k

Chat With Janus-Pro-7B

🌍

2.02k

A unified multimodal understanding and generation model.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Paper • 2501.17811 • Published Jan 29, 2025 • 8
Qwen/QVQ-72B-Preview

Image-Text-to-Text • 73B • Updated Jan 12, 2025 • 418 • 609
moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated 26 days ago • 75.8k • 446
Running on Zero

Featured

194

Chat with Kimi-VL-A3B-Thinking-2506

🤔

194

Chat with Kimi-VL: respond to text, images, video, PDFs
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 137
moonshotai/MoonViT-SO-400M

Image Feature Extraction • 0.4B • Updated Apr 17, 2025 • 933 • 36
google/siglip-so400m-patch14-384

Zero-Shot Image Classification • 0.9B • Updated Sep 26, 2024 • 1.99M • 654
moonshotai/Kimi-VL-A3B-Instruct

Image-Text-to-Text • 16B • Updated 26 days ago • 179k • 256
HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8, 2025 • 51.9k • 577
Runtime error

144

SmolVLM

📊

144

Generate text from images and queries
HuggingFaceTB/SmolVLM2-2.2B-Instruct

Image-Text-to-Text • Updated Apr 8, 2025 • 120k • 303
SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 205
Build error

81

SmolVLM

📊

81

Generate answers by combining text and images
google/gemma-3-27b-it

Image-Text-to-Text • Updated Mar 21, 2025 • 1.66M • • 1.9k
unsloth/gemma-3-27b-it-GGUF

Image-Text-to-Text • 27B • Updated Aug 14, 2025 • 167k • 188
google/gemma-3-27b-it-qat-q4_0-gguf

Image-Text-to-Text • 27B • Updated Apr 11, 2025 • 10k • 380
meta-llama/Llama-4-Scout-17B-16E-Instruct

Image-Text-to-Text • Updated May 22, 2025 • 200k • • 1.22k
meta-llama/Llama-4-Maverick-17B-128E-Instruct

Image-Text-to-Text • Updated May 22, 2025 • 9.36k • • 465
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29, 2024 • 53
deepseek-ai/deepseek-vl2

Image-Text-to-Text • Updated Dec 18, 2024 • 4.27k • 380
Running on Zero

Featured

587

Chat with DeepSeek-VL2-small

🌍

587

Chat with images and text using AI assistant
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Paper • 2412.10302 • Published Dec 13, 2024 • 22
lerobot/pi0_old

Robotics • 4B • Updated Sep 19, 2025 • 650 • 307
nvidia/GR00T-N1-2B

Robotics • 2B • Updated Sep 2, 2025 • 244 • 348
google/paligemma-3b-pt-224

Image-Text-to-Text • Updated Sep 21, 2024 • 42.7k • 415
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 72
Paused

Featured

314

PaliGemma Demo

🤲

314

Annotate and describe images with text prompts
PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 133
Runtime error

96

Paligemma2 Mix

🌖

96

Generate text and segment images using PaliGemma 2
google/paligemma2-10b-mix-448

Image-Text-to-Text • Updated Feb 7, 2025 • 1.19k • 35
allenai/Molmo-72B-0924

Image-Text-to-Text • 73B • Updated Oct 9, 2025 • 6.69k • 296
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 121
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • Updated Jun 6, 2025 • 264k • • 596
Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19, 2025 • 213
google/shieldgemma-2-4b-it

Image-Text-to-Text • Updated Apr 4, 2025 • 8.1k • 147
ShieldGemma 2: Robust and Tractable Image Content Moderation

Paper • 2504.01081 • Published Apr 1, 2025 • 3
Runtime error

12

ShieldGemma2 VLM

📉

12

Demo for ShieldGemma 2, multimodal safety model
meta-llama/Llama-Guard-4-12B

Image-Text-to-Text • Updated Apr 29, 2025 • 226k • • 81
Runtime error

1

Llama Guard 4

🦀

1

Check if text and images are safe
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 2.79k • 56
vidore/colpali-v1.3

Visual Document Retrieval • Updated Mar 14, 2025 • 34.8k • 88
ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27, 2024 • 51
vidore/colqwen2.5-v0.2

Visual Document Retrieval • Updated Jun 16, 2025 • 27.4k • 96
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14, 2025 • 136 • 55
Qwen/Qwen2.5-VL-32B-Instruct

Image-Text-to-Text • Updated Apr 14, 2025 • 184k • • 476
Running

164

Qwen2.5 VL 32B Instruct Demo

🏃

164

Chat with a multimodal AI using text, images, or video
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28, 2025 • 141 • 74
Running on Zero

87

LongVU

🌖

87

Generate responses to video or image inputs
openbmb/RLAIF-V-Dataset

Preview • Updated Oct 14, 2025 • 814 • 206
HuggingFaceH4/rlaif-v_formatted

Viewer • Updated Jul 2, 2024 • 83.1k • 1.05k • 16
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Paper • 2404.16006 • Published Apr 24, 2024 • 2
Kaining/MMT-Bench

Viewer • Updated Jun 21, 2024 • 30k • 43 • 10
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Paper • 2409.02813 • Published Sep 4, 2024 • 33
MMMU/MMMU_Pro

Viewer • Updated Mar 8, 2025 • 5.19k • 6.55k • 45
reducto/RolmOCR

Image-Text-to-Text • Updated Apr 2, 2025 • 2.53k • 581
Alpha-VLLM/Lumina-mGPT-7B-768

Any-to-Any • 7B • Updated Apr 7, 2025 • 6.39k • 38
facebook/chameleon-7b

Image-Text-to-Text • 7B • Updated Jul 23, 2024 • 54.7k • 196

Upvote

Collection guide
Browse collections

Qwen2.5 Omni 7B Demo

Chat With Janus-Pro-7B

Chat with Kimi-VL-A3B-Thinking-2506

SmolVLM

SmolVLM

Chat with DeepSeek-VL2-small

PaliGemma Demo

Paligemma2 Mix

ShieldGemma2 VLM

Llama Guard 4

Qwen2.5 VL 32B Instruct Demo

LongVU