Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation Paper • 2508.05635 • Published Aug 7 • 73
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios Paper • 2507.20198 • Published Jul 27 • 26
view article Article DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge Feb 7 • 250
SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published Apr 7 • 200
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models Paper • 2502.02492 • Published Feb 4 • 66
Vision Models (GGUF) Collection How to use: Download a "mmproj" model file + one or more of the primary model files. • 5 items • Updated Dec 22, 2023 • 45
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Paper • 2404.13013 • Published Apr 19, 2024 • 31
Multimodal Foundation Models: From Specialists to General-Purpose Assistants Paper • 2309.10020 • Published Sep 18, 2023 • 41