Visual Spatial Tuning: VST-7B-SFT
This model is described in the paper Visual Spatial Tuning.
TL;DR: VST is a comprehensive framework designed to cultivate Vision-Language Models (VLMs) with human-like visuospatial abilitiesβfrom spatial perception to advanced reasoning.
π‘ Key Highlights
β¨ VST-P: 4.1M samples across 19 skills, spanning single images, multi-image scenarios, and videosβboosting spatial perception in VLMs.
β¨ VST-R: 135K curated samples that teach models to reason in space, including step-by-step reasoning and rule-based data for reinforcement learning.
β¨ Progressive Training Pipeline: Start with supervised fine-tuning to build foundational spatial knowledge, then reinforce spatial reasoning abilities via RL. VST achieves state-of-the-art results on spatial benchmarks (34.8% on MMSI-Bench, 61.2% on VSIBench) without compromising general capabilities.
β¨ Vision-Language-Action Models Enhanced: The VST paradigm significantly strengthens spatial tuning, paving the way for more physically grounded AI.
π Spatial & General Benchmarks
| Models | CV | 3DSR | MMSI | BLINK | VSI | MMStar | MMB | RealworldQA | MMMU | OCRB | AI2D |
|---|---|---|---|---|---|---|---|---|---|---|---|
| VST-3B-SFT | 84.4 | 54.1 | 30.2 | 59.1 | 57.9 | 58.0 | 80.9 | 68.4 | 45.2 | 83.7 | 82.5 |
| VST-3B-RL | 84.2 | 56.5 | 31.3 | 57.2 | 57.7 | 58.9 | 80.5 | 68.5 | 49.8 | 80.9 | 82.4 |
| VST-7B-SFT | 85.5 | 54.6 | 32.0 | 62.1 | 60.6 | 63.1 | 83.3 | 72.2 | 50.6 | 85.5 | 84.9 |
| VST-7B-RL | 86.5 | 60.1 | 34.8 | 62.6 | 61.2 | 63.5 | 83.0 | 68.5 | 49.4 | 86.1 | 83.5 |
π VSIBench
| Methods | Avg. | Obj. Count | Abs. Dist. | Obj. Size | Room Size | Rel. Dist | Rel. Dir. | Route Plan | Appr. Order |
|---|---|---|---|---|---|---|---|---|---|
| VST-3B-SFT | 57.9 | 69.3 | 45.4 | 71.8 | 62.4 | 59.0 | 46.0 | 38.7 | 70.2 |
| VST-3B-RL | 57.7 | 66.6 | 45.0 | 72.8 | 60.9 | 59.9 | 47.6 | 40.7 | 68.3 |
| VST-7B-SFT | 60.6 | 72.0 | 44.4 | 74.3 | 68.3 | 59.7 | 55.8 | 44.9 | 65.2 |
| VST-7B-RL | 61.2 | 71.6 | 43.8 | 75.5 | 69.2 | 60.0 | 55.6 | 44.3 | 69.2 |
Sample Usage
To get started, install the necessary libraries:
pip install transformers
# It's highly recommanded to use `[decord]` feature for faster video loading.
pip install qwen-vl-utils
Using π€ Transformers to Chat
Here we show a code snippet to show you how to use the chat model with transformers and qwen_vl_utils:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_path="rayruiyang/VST-7B-SFT"
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
# default processer
processor = AutoProcessor.from_pretrained(model_path, min_pixels = 256*28*28, max_pixels=1280*28*28)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://images.cocodataset.org/train2017/000000039685.jpg",
},
{"type": "text", "text": "Consider the real-world 3D locations of the objects. Is the flag directly underneath the airplane?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
Citation
If you find our work helpful, feel free to give us a cite.
@article{vst,
title={Visual Spatial Tuning},
author={Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao},
journal={arXiv preprint arXiv:2511.05491},
year={2025}
}
- Downloads last month
- 1,076