-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2510.05034
-
Intern-S1: A Scientific Multimodal Foundation Model
Paper • 2508.15763 • Published • 256 -
MemMamba: Rethinking Memory Patterns in State Space Model
Paper • 2510.03279 • Published • 72 -
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Paper • 2510.05034 • Published • 46 -
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
Paper • 2510.11696 • Published • 174
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
-
facebook/w2v-bert-2.0
Feature Extraction • 0.6B • Updated • 2.01M • 195 -
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 24.3k • 46 -
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 10.5M • 1.9k -
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 1M • 1.43k
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 83 -
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Paper • 2509.22638 • Published • 67 -
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Paper • 2510.05034 • Published • 46
-
lusxvr/nanoVLM-222M
Image-Text-to-Text • 0.2B • Updated • 258 • 98 -
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 36 -
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper • 2505.24863 • Published • 97 -
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Paper • 2505.17667 • Published • 88
-
Wolf: Captioning Everything with a World Summarization Framework
Paper • 2407.18908 • Published • 32 -
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper • 2407.19985 • Published • 37 -
TPDiff: Temporal Pyramid Video Diffusion Model
Paper • 2503.09566 • Published • 45 -
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
Paper • 2506.07464 • Published • 14
-
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Paper • 2311.12631 • Published • 15 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 56 -
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Paper • 2504.01956 • Published • 41 -
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
Visual Representation Alignment for Multimodal Large Language Models
Paper • 2509.07979 • Published • 83 -
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Paper • 2509.22638 • Published • 67 -
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Paper • 2510.05034 • Published • 46
-
Intern-S1: A Scientific Multimodal Foundation Model
Paper • 2508.15763 • Published • 256 -
MemMamba: Rethinking Memory Patterns in State Space Model
Paper • 2510.03279 • Published • 72 -
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Paper • 2510.05034 • Published • 46 -
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
Paper • 2510.11696 • Published • 174
-
lusxvr/nanoVLM-222M
Image-Text-to-Text • 0.2B • Updated • 258 • 98 -
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 36 -
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper • 2505.24863 • Published • 97 -
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Paper • 2505.17667 • Published • 88
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
-
Wolf: Captioning Everything with a World Summarization Framework
Paper • 2407.18908 • Published • 32 -
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper • 2407.19985 • Published • 37 -
TPDiff: Temporal Pyramid Video Diffusion Model
Paper • 2503.09566 • Published • 45 -
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
Paper • 2506.07464 • Published • 14
-
facebook/w2v-bert-2.0
Feature Extraction • 0.6B • Updated • 2.01M • 195 -
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 24.3k • 46 -
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 10.5M • 1.9k -
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 1M • 1.43k
-
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Paper • 2311.12631 • Published • 15 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 56 -
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Paper • 2504.01956 • Published • 41 -
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Paper • 2506.23219 • Published • 7