ShaoRun
/

RS-EoT-7B

PyTorch

qwen2_5_vl

Model card Files Files and versions

xet

Community

ShaoRun commited on 19 days ago

Commit

fa821e1

verified ·

1 Parent(s): 95a0232

Update README.md

Browse files

Files changed (1) hide show

README.md +198 -1

README.md CHANGED Viewed

@@ -4,4 +4,201 @@ datasets:
 - ShaoRun/RS-EoT-4K
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
----

 - ShaoRun/RS-EoT-4K
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
+---
+# RS-EoT-7B: Remote Sensing Evidence-of-Thought
+<div align="center">
+[**🌐 Project Website**](coming_soon) | [**💻 GitHub Repository**](coming_soon) | [**📄 Paper (ArXiv)**](coming_soon) | [**🤗 Dataset (RS-EoT-4K)**](https://huggingface.co/datasets/ShaoRun/RS-EoT-4K)
+</div>
+## 📖 Introduction
+**RS-EoT-7B** is a multimodal reasoning model tailored for Remote Sensing (RS) imagery. It introduces the **Evidence-of-Thought (EoT)** paradigm to mitigate the "Glance Effect"—a phenomenon where models hallucinate reasoning without genuinely inspecting visual evidence.
+Unlike standard VLMs that rely on a single coarse perception, RS-EoT-7B employs an iterative evidence-seeking mechanism. It has been trained using a two-stage pipeline:
+1.  **SFT Cold-Start**: Supervised Fine-Tuning on the **RS-EoT-4K** dataset (synthesized via SocraticAgent) to instill the iterative reasoning pattern.
+2.  **Progressive RL**: Reinforcement Learning on Fine-grained Grounding and General VQA tasks to enhance evidence-seeking capabilities and generalize to broader scenarios.
+## 🛠️ Quick Start
+### Installation
+Ensure you have the latest `transformers` and `qwen-vl-utils` installed:
+```bash
+pip install transformers
+pip install qwen-vl-utils
+````
+### 1\. Visual Question Answering (VQA)
+This example demonstrates how to ask the model a question and receive a reasoning-backed answer.
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+import torch
+# Load model and processor
+model_name = "ShaoRun/RS-EoT-7B"
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    model_name, torch_dtype="auto", device_map="auto"
+)
+processor = AutoProcessor.from_pretrained(model_name)
+# Define input image (assumes demo.jpg is in the current directory)
+image_path = "./demo.jpg"
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image_path},
+            {"type": "text", "text": "How many cars in this image?"},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference
+generated_ids = model.generate(**inputs, max_new_tokens=4096)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text[0])
+```
+### 2\. Visual Grounding with Visualization
+This example shows how to perform visual grounding and visualize the output bounding boxes.
+```python
+import re
+import torch
+from PIL import Image, ImageDraw, ImageFont
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# --- Helper Functions for Parsing and Visualization ---
+def extract_bbox_list_in(text: str) -> list[list[float]]:
+    """Extracts bounding boxes from the model output text."""
+    boxes = []
+    text = re.sub(r'\\([{}\[\]":,])', r'\1', text)
+    # Pattern to find lists of numbers like [x1, y1, x2, y2]
+    pattern = re.compile(r'\[\s*(.*?)\s*\]', flags=re.IGNORECASE | re.DOTALL)
+    matches = pattern.findall(text)
+    number_pattern = r'-?\d+\.\d+|-?\d+'
+    for match in matches:
+        nums = re.findall(number_pattern, match)
+        if len(nums) >= 4:
+            # Take the first 4 numbers as the box
+            box = [float(num) for num in nums[:4]]
+            boxes.append(box)
+    return boxes
+def visualize_bboxes(img: Image.Image, boxes: list[list[float]], color=(0, 255, 0), width=3) -> Image.Image:
+    """Draws bounding boxes on the image."""
+    out = img.copy()
+    draw = ImageDraw.Draw(out)
+    W, H = img.size
+    for b in boxes:
+        if len(b) < 4: continue
+        x1, y1, x2, y2 = b[:4]
+        # Ensure coordinates are within bounds
+        x1, y1 = max(0, min(W-1, x1)), max(0, min(H-1, y1))
+        x2, y2 = max(0, min(W-1, x2)), max(0, min(H-1, y2))
+        # Draw rectangle with thickness
+        draw.rectangle([x1, y1, x2, y2], outline=color, width=width)
+    return out
+# --- Main Inference Code ---
+model_name = "ShaoRun/RS-EoT-7B"
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    model_name, torch_dtype="auto", device_map="auto"
+)
+processor = AutoProcessor.from_pretrained(model_name)
+# Load Image
+image_path = "./demo.jpg"
+image = Image.open(image_path).convert('RGB')
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": 'Locate the black car parked on the right in the remote sensing image. Return the coordinates as "[x1, y1, x2, y2]".'},
+        ],
+    }
+]
+# Process Inputs
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Generate
+generated_ids = model.generate(**inputs, max_new_tokens=4096)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+response = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)[0]
+print(f"Model Response:\n{response}")
+# Parse and Visualize
+answer_part = response.split("</think>")[-1]
+detection = extract_bbox_list_in(answer_part)
+if detection:
+    print(f"Detected BBoxes: {detection}")
+    vis_img = visualize_bboxes(image, detection)
+    vis_img.save("./res.jpg")
+    print("Visualization saved to ./res.jpg")
+else:
+    print("No bounding boxes detected in the response.")
+```
+## 🖊️ Citation
+If you use this model in your research, please cite our paper:
+```bibtex
+coming soon
+```