Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2405.15223

Perception and abstraction. Each modality is tokenized and embedded into vectors for model to comprehend.

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
Octopus v4: Graph of language models

Paper • 2404.19296 • Published Apr 30, 2024 • 118
Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26, 2024 • 48
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

Paper • 2408.15518 • Published Aug 28, 2024 • 42

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
thuml/ivideogpt-oxe-64-act-free

Updated Dec 17, 2024 • 2
thuml/ivideogpt-oxe-64-goal-cond

Updated Dec 17, 2024
thuml/ivideogpt-oxe-64-act-free-medium

Updated Dec 17, 2024

How Far Are We from Intelligent Visual Deductive Reasoning?

Paper • 2403.04732 • Published Mar 7, 2024 • 23
MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 77
DragAnything: Motion Control for Anything using Entity Representation

Paper • 2403.07420 • Published Mar 12, 2024 • 15
Learning and Leveraging World Models in Visual Representation Learning

Paper • 2403.00504 • Published Mar 1, 2024 • 33

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

video llm works

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25, 2024 • 36
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Paper • 2403.09626 • Published Mar 14, 2024 • 16

Video as the New Language for Real-World Decision Making

Paper • 2402.17139 • Published Feb 27, 2024 • 21
Learning and Leveraging World Models in Visual Representation Learning

Paper • 2403.00504 • Published Mar 1, 2024 • 33
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 29
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Paper • 2403.05438 • Published Mar 8, 2024 • 21

Perception and abstraction. Each modality is tokenized and embedded into vectors for model to comprehend.

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
Octopus v4: Graph of language models

Paper • 2404.19296 • Published Apr 30, 2024 • 118
Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26, 2024 • 48
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models

Paper • 2408.15518 • Published Aug 28, 2024 • 42

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
thuml/ivideogpt-oxe-64-act-free

Updated Dec 17, 2024 • 2
thuml/ivideogpt-oxe-64-goal-cond

Updated Dec 17, 2024
thuml/ivideogpt-oxe-64-act-free-medium

Updated Dec 17, 2024

video llm works

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25, 2024 • 36
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 30
VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Paper • 2403.09626 • Published Mar 14, 2024 • 16

How Far Are We from Intelligent Visual Deductive Reasoning?

Paper • 2403.04732 • Published Mar 7, 2024 • 23
MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 77
DragAnything: Motion Control for Anything using Entity Representation

Paper • 2403.07420 • Published Mar 12, 2024 • 15
Learning and Leveraging World Models in Visual Representation Learning

Paper • 2403.00504 • Published Mar 1, 2024 • 33

Video as the New Language for Real-World Decision Making

Paper • 2402.17139 • Published Feb 27, 2024 • 21
Learning and Leveraging World Models in Visual Representation Learning

Paper • 2403.00504 • Published Mar 1, 2024 • 33
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 29
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Paper • 2403.05438 • Published Mar 8, 2024 • 21

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs