Collections
Discover the best community collections!
Collections including paper arxiv:2512.21218
-
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Paper β’ 2512.16093 β’ Published β’ 95 -
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper β’ 2511.22699 β’ Published β’ 237 -
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper β’ 2512.16676 β’ Published β’ 218 -
Sharp Monocular View Synthesis in Less Than a Second
Paper β’ 2512.10685 β’ Published β’ 28
-
Perception-Aware Policy Optimization for Multimodal Reasoning
Paper β’ 2507.06448 β’ Published β’ 48 -
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Paper β’ 2507.05920 β’ Published β’ 12 -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper β’ 2508.18265 β’ Published β’ 214 -
Latent Chain-of-Thought for Visual Reasoning
Paper β’ 2510.23925 β’ Published β’ 10
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper β’ 2502.11573 β’ Published β’ 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper β’ 2502.02339 β’ Published β’ 23 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper β’ 2502.11775 β’ Published β’ 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper β’ 2412.18319 β’ Published β’ 39
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
-
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Paper β’ 2511.16334 β’ Published β’ 93 -
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Paper β’ 2509.07980 β’ Published β’ 103 -
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
Paper β’ 2509.04475 β’ Published β’ 3 -
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
Paper β’ 2512.01374 β’ Published β’ 105
-
Nuclear Norm Regularization for Deep Learning
Paper β’ 2405.14544 β’ Published β’ 1 -
Token embeddings violate the manifold hypothesis
Paper β’ 2504.01002 β’ Published β’ 1 -
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Paper β’ 2403.10476 β’ Published β’ 1 -
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
Paper β’ 2504.00254 β’ Published β’ 1
-
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Paper β’ 2312.15715 β’ Published β’ 20 -
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Paper β’ 2505.23747 β’ Published β’ 69 -
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper β’ 2402.13217 β’ Published β’ 38 -
Scaling RL to Long Videos
Paper β’ 2507.07966 β’ Published β’ 160
-
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Paper β’ 2512.16093 β’ Published β’ 95 -
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper β’ 2511.22699 β’ Published β’ 237 -
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper β’ 2512.16676 β’ Published β’ 218 -
Sharp Monocular View Synthesis in Less Than a Second
Paper β’ 2512.10685 β’ Published β’ 28
-
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Paper β’ 2511.16334 β’ Published β’ 93 -
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Paper β’ 2509.07980 β’ Published β’ 103 -
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
Paper β’ 2509.04475 β’ Published β’ 3 -
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
Paper β’ 2512.01374 β’ Published β’ 105
-
Perception-Aware Policy Optimization for Multimodal Reasoning
Paper β’ 2507.06448 β’ Published β’ 48 -
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Paper β’ 2507.05920 β’ Published β’ 12 -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper β’ 2508.18265 β’ Published β’ 214 -
Latent Chain-of-Thought for Visual Reasoning
Paper β’ 2510.23925 β’ Published β’ 10
-
Nuclear Norm Regularization for Deep Learning
Paper β’ 2405.14544 β’ Published β’ 1 -
Token embeddings violate the manifold hypothesis
Paper β’ 2504.01002 β’ Published β’ 1 -
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Paper β’ 2403.10476 β’ Published β’ 1 -
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
Paper β’ 2504.00254 β’ Published β’ 1
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper β’ 2502.11573 β’ Published β’ 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper β’ 2502.02339 β’ Published β’ 23 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper β’ 2502.11775 β’ Published β’ 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper β’ 2412.18319 β’ Published β’ 39
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 23
-
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Paper β’ 2312.15715 β’ Published β’ 20 -
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Paper β’ 2505.23747 β’ Published β’ 69 -
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper β’ 2402.13217 β’ Published β’ 38 -
Scaling RL to Long Videos
Paper β’ 2507.07966 β’ Published β’ 160