Qwen3-VL JEPA World Model
This is a Multimodal World Model architecture based on the Joint-Embedding Predictive Architecture (JEPA). It fuses the reasoning power of Qwen3-VL-4B-Thinking with the visual latent space of Stable Diffusion VAE.
π§ Architecture
- Thinking Engine:
Qwen/Qwen3-VL-4B-Thinking - Visual Perception:
runwayml/stable-diffusion-v1-5(VAE) - World Modeling: Designed to predict the next latent state of a scene.
π Status
This repository contains the structural fuse. The predictors are currently randomly initialized and require training on sequential image data to function as a world model.
- Downloads last month
- 20
Model tree for burnboom/Qwen3_world_model_test
Base model
Qwen/Qwen3-VL-4B-Thinking