--- license: apache-2.0 datasets: - ShaoRun/RS-EoT-4K base_model: - Qwen/Qwen2.5-VL-7B-Instruct --- # RS-EoT-7B: Remote Sensing Evidence-of-Thought

[**🌐 Project Website**](https://geox-lab.github.io/Asking_like_Socrates/) | [**💻 GitHub Repository**](https://github.com/GeoX-Lab/Asking_like_Socrates) | [**📄 Paper (ArXiv)**](https://arxiv.org/abs/2511.22396) | [**🤗 Dataset (RS-EoT-4K)**](https://huggingface.co/datasets/ShaoRun/RS-EoT-4K)

## 📖 Introduction **RS-EoT-7B** is a multimodal reasoning model tailored for Remote Sensing (RS) imagery. It introduces the **Evidence-of-Thought (EoT)** paradigm to mitigate the "Glance Effect"—a phenomenon where models hallucinate reasoning without genuinely inspecting visual evidence. Unlike standard VLMs that rely on a single coarse perception, RS-EoT-7B employs an iterative evidence-seeking mechanism. It has been trained using a two-stage pipeline: 1. **SFT Cold-Start**: Supervised Fine-Tuning on the **RS-EoT-4K** dataset (synthesized via SocraticAgent) to instill the iterative reasoning pattern. 2. **Progressive RL**: Reinforcement Learning on Fine-grained Grounding and General VQA tasks to enhance evidence-seeking capabilities and generalize to broader scenarios. ## 🛠️ Quick Start ### Installation Ensure you have the latest `transformers` and `qwen-vl-utils` installed: ```bash pip install transformers pip install qwen-vl-utils ```` ### 1\. Visual Question Answering (VQA) This example demonstrates how to ask the model a question and receive a reasoning-backed answer. ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info import torch # Load model and processor model_name = "ShaoRun/RS-EoT-7B" model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained(model_name) # Define input image (assumes demo.jpg is in the current directory) image_path = "./demo.jpg" messages = [ { "role": "user", "content": [ {"type": "image", "image": image_path}, {"type": "text", "text": "How many cars in this image?"}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference generated_ids = model.generate(**inputs, max_new_tokens=4096) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text[0]) ``` ### 2\. Visual Grounding with Visualization This example shows how to perform visual grounding and visualize the output bounding boxes. ```python import re import torch from PIL import Image, ImageDraw, ImageFont from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info # --- Helper Functions for Parsing and Visualization --- def extract_bbox_list_in(text: str) -> list[list[float]]: """Extracts bounding boxes from the model output text.""" boxes = [] text = re.sub(r'\\([{}\[\]":,])', r'\1', text) # Pattern to find lists of numbers like [x1, y1, x2, y2] pattern = re.compile(r'\[\s*(.*?)\s*\]', flags=re.IGNORECASE | re.DOTALL) matches = pattern.findall(text) number_pattern = r'-?\d+\.\d+|-?\d+' for match in matches: nums = re.findall(number_pattern, match) if len(nums) >= 4: # Take the first 4 numbers as the box box = [float(num) for num in nums[:4]] boxes.append(box) return boxes def visualize_bboxes(img: Image.Image, boxes: list[list[float]], color=(0, 255, 0), width=3) -> Image.Image: """Draws bounding boxes on the image.""" out = img.copy() draw = ImageDraw.Draw(out) W, H = img.size for b in boxes: if len(b) < 4: continue x1, y1, x2, y2 = b[:4] # Ensure coordinates are within bounds x1, y1 = max(0, min(W-1, x1)), max(0, min(H-1, y1)) x2, y2 = max(0, min(W-1, x2)), max(0, min(H-1, y2)) # Draw rectangle with thickness draw.rectangle([x1, y1, x2, y2], outline=color, width=width) return out # --- Main Inference Code --- model_name = "ShaoRun/RS-EoT-7B" model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained(model_name) # Load Image image_path = "./demo.jpg" image = Image.open(image_path).convert('RGB') messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": 'Locate the black car parked on the right in the remote sensing image. Return the coordinates as "[x1, y1, x2, y2]".'}, ], } ] # Process Inputs text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Generate generated_ids = model.generate(**inputs, max_new_tokens=4096) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] response = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f"Model Response:\n{response}") # Parse and Visualize answer_part = response.split("")[-1] detection = extract_bbox_list_in(answer_part) if detection: print(f"Detected BBoxes: {detection}") vis_img = visualize_bboxes(image, detection) vis_img.save("./res.jpg") print("Visualization saved to ./res.jpg") else: print("No bounding boxes detected in the response.") ``` ## 🖊️ Citation If you use this model in your research, please cite our paper: ```bibtex @article{shao2025asking, title={Asking like Socrates: Socrates helps VLMs understand remote sensing images}, author={Shao, Run and Li, Ziyu and Zhang, Zhaoyang and Xu, Linrui and He, Xinran and Yuan, Hongyuan and He, Bolei and Dai, Yongxing and Yan, Yiming and Chen, Yijun and others}, journal={arXiv preprint arXiv:2511.22396}, year={2025} } ```