File size: 6,769 Bytes

---
license: apache-2.0
datasets:
- ShaoRun/RS-EoT-4K
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---

# RS-EoT-7B: Remote Sensing Evidence-of-Thought

<div align="center">

[**🌐 Project Website**](https://geox-lab.github.io/Asking_like_Socrates/) | [**💻 GitHub Repository**](https://github.com/GeoX-Lab/Asking_like_Socrates) | [**📄 Paper (ArXiv)**](https://arxiv.org/abs/2511.22396) | [**🤗 Dataset (RS-EoT-4K)**](https://huggingface.co/datasets/ShaoRun/RS-EoT-4K)

</div>

## 📖 Introduction

**RS-EoT-7B** is a multimodal reasoning model tailored for Remote Sensing (RS) imagery. It introduces the **Evidence-of-Thought (EoT)** paradigm to mitigate the "Glance Effect"—a phenomenon where models hallucinate reasoning without genuinely inspecting visual evidence.

Unlike standard VLMs that rely on a single coarse perception, RS-EoT-7B employs an iterative evidence-seeking mechanism. It has been trained using a two-stage pipeline:
1.  **SFT Cold-Start**: Supervised Fine-Tuning on the **RS-EoT-4K** dataset (synthesized via SocraticAgent) to instill the iterative reasoning pattern.
2.  **Progressive RL**: Reinforcement Learning on Fine-grained Grounding and General VQA tasks to enhance evidence-seeking capabilities and generalize to broader scenarios.

## 🛠️ Quick Start

### Installation

Ensure you have the latest `transformers` and `qwen-vl-utils` installed:

```bash
pip install transformers
pip install qwen-vl-utils
````

### 1\. Visual Question Answering (VQA)

This example demonstrates how to ask the model a question and receive a reasoning-backed answer.

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load model and processor
model_name = "ShaoRun/RS-EoT-7B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Define input image (assumes demo.jpg is in the current directory)
image_path = "./demo.jpg" 

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": "How many cars in this image?"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
```

### 2\. Visual Grounding with Visualization

This example shows how to perform visual grounding and visualize the output bounding boxes.

```python
import re
import torch
from PIL import Image, ImageDraw, ImageFont
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# --- Helper Functions for Parsing and Visualization ---

def extract_bbox_list_in(text: str) -> list[list[float]]:
    """Extracts bounding boxes from the model output text."""
    boxes = []
    text = re.sub(r'\\([{}\[\]":,])', r'\1', text)
    # Pattern to find lists of numbers like [x1, y1, x2, y2]
    pattern = re.compile(r'\[\s*(.*?)\s*\]', flags=re.IGNORECASE | re.DOTALL)
    matches = pattern.findall(text)
    
    number_pattern = r'-?\d+\.\d+|-?\d+'
    for match in matches:
        nums = re.findall(number_pattern, match)
        if len(nums) >= 4:
            # Take the first 4 numbers as the box
            box = [float(num) for num in nums[:4]]
            boxes.append(box)
    return boxes

def visualize_bboxes(img: Image.Image, boxes: list[list[float]], color=(0, 255, 0), width=3) -> Image.Image:
    """Draws bounding boxes on the image."""
    out = img.copy()
    draw = ImageDraw.Draw(out)
    W, H = img.size
    
    for b in boxes:
        if len(b) < 4: continue
        x1, y1, x2, y2 = b[:4]
        
        # Ensure coordinates are within bounds
        x1, y1 = max(0, min(W-1, x1)), max(0, min(H-1, y1))
        x2, y2 = max(0, min(W-1, x2)), max(0, min(H-1, y2))
        
        # Draw rectangle with thickness
        draw.rectangle([x1, y1, x2, y2], outline=color, width=width)
        
    return out

# --- Main Inference Code ---

model_name = "ShaoRun/RS-EoT-7B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Load Image
image_path = "./demo.jpg"
image = Image.open(image_path).convert('RGB')

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": 'Locate the black car parked on the right in the remote sensing image. Return the coordinates as "[x1, y1, x2, y2]".'},
        ],
    }
]

# Process Inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(f"Model Response:\n{response}")

# Parse and Visualize
answer_part = response.split("</think>")[-1] 
detection = extract_bbox_list_in(answer_part)

if detection:
    print(f"Detected BBoxes: {detection}")
    vis_img = visualize_bboxes(image, detection)
    vis_img.save("./res.jpg")
    print("Visualization saved to ./res.jpg")
else:
    print("No bounding boxes detected in the response.")
```

## 🖊️ Citation

If you use this model in your research, please cite our paper:

```bibtex
@article{shao2025asking,
  title={Asking like Socrates: Socrates helps VLMs understand remote sensing images},
  author={Shao, Run and Li, Ziyu and Zhang, Zhaoyang and Xu, Linrui and He, Xinran and Yuan, Hongyuan and He, Bolei and Dai, Yongxing and Yan, Yiming and Chen, Yijun and others},
  journal={arXiv preprint arXiv:2511.22396},
  year={2025}
}
```