-
Qwen/Qwen2.5-Omni-7B
Any-to-Any β’ 11B β’ Updated β’ 129k β’ 1.82k -
Qwen2.5 Omni 7B Demo
π361Generate text and speech from text, audio, images, and videos
-
Qwen2.5-Omni Technical Report
Paper β’ 2503.20215 β’ Published β’ 166 -
openbmb/MiniCPM-o-2_6
Any-to-Any β’ 9B β’ Updated β’ 103k β’ 1.27k
Collections
Discover the best community collections!
Collections including paper arxiv:2409.02813
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper β’ 2405.15223 β’ Published β’ 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper β’ 2405.15574 β’ Published β’ 55 -
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 90 -
Matryoshka Multimodal Models
Paper β’ 2405.17430 β’ Published β’ 34
-
Meta-Learning a Dynamical Language Model
Paper β’ 1803.10631 β’ Published β’ 1 -
TLDR: Token Loss Dynamic Reweighting for Reducing Repetitive Utterance Generation
Paper β’ 2003.11963 β’ Published -
BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model
Paper β’ 2212.04960 β’ Published β’ 1 -
Continuous Learning in a Hierarchical Multiscale Neural Network
Paper β’ 1805.05758 β’ Published β’ 2
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper β’ 2403.09611 β’ Published β’ 129 -
Evolutionary Optimization of Model Merging Recipes
Paper β’ 2403.13187 β’ Published β’ 58 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper β’ 2402.03766 β’ Published β’ 15 -
LLM Agent Operating System
Paper β’ 2403.16971 β’ Published β’ 72
-
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper β’ 2407.07053 β’ Published β’ 47 -
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper β’ 2407.12772 β’ Published β’ 35 -
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper β’ 2407.11691 β’ Published β’ 15 -
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper β’ 2408.02718 β’ Published β’ 62
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Paper β’ 2409.02813 β’ Published β’ 31 -
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Paper β’ 2404.16006 β’ Published -
LMArena Leaderboard
π4.67kDisplay LMArena Leaderboard
-
Multimodal Clembench
π3Explore and compare multimodal models with interactive leaderboards and plots
-
SEED-Bench Leaderboard
π85Submit model evaluation results to leaderboard
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper β’ 2311.16502 β’ Published β’ 37 -
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Paper β’ 2409.02813 β’ Published β’ 31
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper β’ 2405.07863 β’ Published β’ 71 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper β’ 2405.09818 β’ Published β’ 131 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper β’ 2405.15574 β’ Published β’ 55 -
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 90
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper β’ 2401.15947 β’ Published β’ 53 -
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper β’ 2404.16821 β’ Published β’ 57 -
Physically Grounded Vision-Language Models for Robotic Manipulation
Paper β’ 2309.02561 β’ Published β’ 9 -
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Paper β’ 2409.02813 β’ Published β’ 31
-
Qwen/Qwen2.5-Omni-7B
Any-to-Any β’ 11B β’ Updated β’ 129k β’ 1.82k -
Qwen2.5 Omni 7B Demo
π361Generate text and speech from text, audio, images, and videos
-
Qwen2.5-Omni Technical Report
Paper β’ 2503.20215 β’ Published β’ 166 -
openbmb/MiniCPM-o-2_6
Any-to-Any β’ 9B β’ Updated β’ 103k β’ 1.27k
-
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper β’ 2407.07053 β’ Published β’ 47 -
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper β’ 2407.12772 β’ Published β’ 35 -
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper β’ 2407.11691 β’ Published β’ 15 -
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper β’ 2408.02718 β’ Published β’ 62
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper β’ 2405.15223 β’ Published β’ 17 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper β’ 2405.15574 β’ Published β’ 55 -
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 90 -
Matryoshka Multimodal Models
Paper β’ 2405.17430 β’ Published β’ 34
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Paper β’ 2409.02813 β’ Published β’ 31 -
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Paper β’ 2404.16006 β’ Published -
LMArena Leaderboard
π4.67kDisplay LMArena Leaderboard
-
Meta-Learning a Dynamical Language Model
Paper β’ 1803.10631 β’ Published β’ 1 -
TLDR: Token Loss Dynamic Reweighting for Reducing Repetitive Utterance Generation
Paper β’ 2003.11963 β’ Published -
BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model
Paper β’ 2212.04960 β’ Published β’ 1 -
Continuous Learning in a Hierarchical Multiscale Neural Network
Paper β’ 1805.05758 β’ Published β’ 2
-
Multimodal Clembench
π3Explore and compare multimodal models with interactive leaderboards and plots
-
SEED-Bench Leaderboard
π85Submit model evaluation results to leaderboard
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper β’ 2311.16502 β’ Published β’ 37 -
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Paper β’ 2409.02813 β’ Published β’ 31
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper β’ 2405.07863 β’ Published β’ 71 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper β’ 2405.09818 β’ Published β’ 131 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper β’ 2405.15574 β’ Published β’ 55 -
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 90
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper β’ 2403.09611 β’ Published β’ 129 -
Evolutionary Optimization of Model Merging Recipes
Paper β’ 2403.13187 β’ Published β’ 58 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper β’ 2402.03766 β’ Published β’ 15 -
LLM Agent Operating System
Paper β’ 2403.16971 β’ Published β’ 72
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper β’ 2401.15947 β’ Published β’ 53 -
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper β’ 2404.16821 β’ Published β’ 57 -
Physically Grounded Vision-Language Models for Robotic Manipulation
Paper β’ 2309.02561 β’ Published β’ 9 -
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Paper β’ 2409.02813 β’ Published β’ 31