Qwen3-VL JEPA World Model

This is a Multimodal World Model architecture based on the Joint-Embedding Predictive Architecture (JEPA). It fuses the reasoning power of Qwen3-VL-4B-Thinking with the visual latent space of Stable Diffusion VAE.

🧠 Architecture

  • Thinking Engine: Qwen/Qwen3-VL-4B-Thinking
  • Visual Perception: runwayml/stable-diffusion-v1-5 (VAE)
  • World Modeling: Designed to predict the next latent state of a scene.

πŸ›  Status

This repository contains the structural fuse. The predictors are currently randomly initialized and require training on sequential image data to function as a world model.

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for burnboom/Qwen3_world_model_test

Finetuned
(16)
this model