-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2510.08673
-
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Paper • 2510.23587 • Published • 65 -
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Paper • 2510.05684 • Published • 139 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 122 -
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Paper • 2511.08892 • Published • 177
-
A Survey of Reinforcement Learning for Large Reasoning Models
Paper • 2509.08827 • Published • 188 -
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Paper • 2508.02193 • Published • 130 -
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Paper • 2510.23607 • Published • 172 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 122
-
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Paper • 2509.22638 • Published • 67 -
Don't Just Fine-tune the Agent, Tune the Environment
Paper • 2510.10197 • Published • 28 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 122 -
Agent Learning via Early Experience
Paper • 2510.08558 • Published • 264
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
-
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 122 -
KangLiao/Puffin
Text-to-3D • Updated • 22 -
Puffin
👀25Generate images from scene prompts with camera parameters
-
KangLiao/Puffin-4M
Viewer • Updated • 3.18B • 6.14k • 28
-
HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video
Paper • 2510.05560 • Published • 7 -
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Paper • 2510.06217 • Published • 63 -
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 485 -
Fast-dLLM v2: Efficient Block-Diffusion LLM
Paper • 2509.26328 • Published • 53
-
MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds
Paper • 2508.14879 • Published • 67 -
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Paper • 2508.19247 • Published • 41 -
Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels
Paper • 2508.17437 • Published • 37 -
Multi-View 3D Point Tracking
Paper • 2508.21060 • Published • 22
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
-
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Paper • 2510.23587 • Published • 65 -
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Paper • 2510.05684 • Published • 139 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 122 -
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Paper • 2511.08892 • Published • 177
-
A Survey of Reinforcement Learning for Large Reasoning Models
Paper • 2509.08827 • Published • 188 -
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Paper • 2508.02193 • Published • 130 -
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Paper • 2510.23607 • Published • 172 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 122
-
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 122 -
KangLiao/Puffin
Text-to-3D • Updated • 22 -
Puffin
👀25Generate images from scene prompts with camera parameters
-
KangLiao/Puffin-4M
Viewer • Updated • 3.18B • 6.14k • 28
-
HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video
Paper • 2510.05560 • Published • 7 -
TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Paper • 2510.06217 • Published • 63 -
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 485 -
Fast-dLLM v2: Efficient Block-Diffusion LLM
Paper • 2509.26328 • Published • 53
-
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Paper • 2509.22638 • Published • 67 -
Don't Just Fine-tune the Agent, Tune the Environment
Paper • 2510.10197 • Published • 28 -
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Paper • 2510.08673 • Published • 122 -
Agent Learning via Early Experience
Paper • 2510.08558 • Published • 264
-
MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds
Paper • 2508.14879 • Published • 67 -
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Paper • 2508.19247 • Published • 41 -
Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels
Paper • 2508.17437 • Published • 37 -
Multi-View 3D Point Tracking
Paper • 2508.21060 • Published • 22