---
license: apache-2.0
datasets:
- ShaoRun/RS-EoT-4K
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---
# RS-EoT-7B: Remote Sensing Evidence-of-Thought
[**🌐 Project Website**](https://geox-lab.github.io/Asking_like_Socrates/) | [**💻 GitHub Repository**](https://github.com/GeoX-Lab/Asking_like_Socrates) | [**📄 Paper (ArXiv)**](https://arxiv.org/abs/2511.22396) | [**🤗 Dataset (RS-EoT-4K)**](https://huggingface.co/datasets/ShaoRun/RS-EoT-4K)
## 📖 Introduction
**RS-EoT-7B** is a multimodal reasoning model tailored for Remote Sensing (RS) imagery. It introduces the **Evidence-of-Thought (EoT)** paradigm to mitigate the "Glance Effect"—a phenomenon where models hallucinate reasoning without genuinely inspecting visual evidence.
Unlike standard VLMs that rely on a single coarse perception, RS-EoT-7B employs an iterative evidence-seeking mechanism. It has been trained using a two-stage pipeline:
1. **SFT Cold-Start**: Supervised Fine-Tuning on the **RS-EoT-4K** dataset (synthesized via SocraticAgent) to instill the iterative reasoning pattern.
2. **Progressive RL**: Reinforcement Learning on Fine-grained Grounding and General VQA tasks to enhance evidence-seeking capabilities and generalize to broader scenarios.
## 🛠️ Quick Start
### Installation
Ensure you have the latest `transformers` and `qwen-vl-utils` installed:
```bash
pip install transformers
pip install qwen-vl-utils
````
### 1\. Visual Question Answering (VQA)
This example demonstrates how to ask the model a question and receive a reasoning-backed answer.
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Load model and processor
model_name = "ShaoRun/RS-EoT-7B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
# Define input image (assumes demo.jpg is in the current directory)
image_path = "./demo.jpg"
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": "How many cars in this image?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```
### 2\. Visual Grounding with Visualization
This example shows how to perform visual grounding and visualize the output bounding boxes.
```python
import re
import torch
from PIL import Image, ImageDraw, ImageFont
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# --- Helper Functions for Parsing and Visualization ---
def extract_bbox_list_in(text: str) -> list[list[float]]:
"""Extracts bounding boxes from the model output text."""
boxes = []
text = re.sub(r'\\([{}\[\]":,])', r'\1', text)
# Pattern to find lists of numbers like [x1, y1, x2, y2]
pattern = re.compile(r'\[\s*(.*?)\s*\]', flags=re.IGNORECASE | re.DOTALL)
matches = pattern.findall(text)
number_pattern = r'-?\d+\.\d+|-?\d+'
for match in matches:
nums = re.findall(number_pattern, match)
if len(nums) >= 4:
# Take the first 4 numbers as the box
box = [float(num) for num in nums[:4]]
boxes.append(box)
return boxes
def visualize_bboxes(img: Image.Image, boxes: list[list[float]], color=(0, 255, 0), width=3) -> Image.Image:
"""Draws bounding boxes on the image."""
out = img.copy()
draw = ImageDraw.Draw(out)
W, H = img.size
for b in boxes:
if len(b) < 4: continue
x1, y1, x2, y2 = b[:4]
# Ensure coordinates are within bounds
x1, y1 = max(0, min(W-1, x1)), max(0, min(H-1, y1))
x2, y2 = max(0, min(W-1, x2)), max(0, min(H-1, y2))
# Draw rectangle with thickness
draw.rectangle([x1, y1, x2, y2], outline=color, width=width)
return out
# --- Main Inference Code ---
model_name = "ShaoRun/RS-EoT-7B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
# Load Image
image_path = "./demo.jpg"
image = Image.open(image_path).convert('RGB')
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": 'Locate the black car parked on the right in the remote sensing image. Return the coordinates as "[x1, y1, x2, y2]".'},
],
}
]
# Process Inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"Model Response:\n{response}")
# Parse and Visualize
answer_part = response.split("")[-1]
detection = extract_bbox_list_in(answer_part)
if detection:
print(f"Detected BBoxes: {detection}")
vis_img = visualize_bboxes(image, detection)
vis_img.save("./res.jpg")
print("Visualization saved to ./res.jpg")
else:
print("No bounding boxes detected in the response.")
```
## 🖊️ Citation
If you use this model in your research, please cite our paper:
```bibtex
@article{shao2025asking,
title={Asking like Socrates: Socrates helps VLMs understand remote sensing images},
author={Shao, Run and Li, Ziyu and Zhang, Zhaoyang and Xu, Linrui and He, Xinran and Yuan, Hongyuan and He, Bolei and Dai, Yongxing and Yan, Yiming and Chen, Yijun and others},
journal={arXiv preprint arXiv:2511.22396},
year={2025}
}
```