PyTorch
qwen2_5_vl
ShaoRun commited on
Commit
fa821e1
Β·
verified Β·
1 Parent(s): 95a0232

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -1
README.md CHANGED
@@ -4,4 +4,201 @@ datasets:
4
  - ShaoRun/RS-EoT-4K
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - ShaoRun/RS-EoT-4K
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
7
+ ---
8
+
9
+ # RS-EoT-7B: Remote Sensing Evidence-of-Thought
10
+
11
+ <div align="center">
12
+
13
+ [**🌐 Project Website**](coming_soon) | [**πŸ’» GitHub Repository**](coming_soon) | [**πŸ“„ Paper (ArXiv)**](coming_soon) | [**πŸ€— Dataset (RS-EoT-4K)**](https://huggingface.co/datasets/ShaoRun/RS-EoT-4K)
14
+
15
+ </div>
16
+
17
+ ## πŸ“– Introduction
18
+
19
+ **RS-EoT-7B** is a multimodal reasoning model tailored for Remote Sensing (RS) imagery. It introduces the **Evidence-of-Thought (EoT)** paradigm to mitigate the "Glance Effect"β€”a phenomenon where models hallucinate reasoning without genuinely inspecting visual evidence.
20
+
21
+ Unlike standard VLMs that rely on a single coarse perception, RS-EoT-7B employs an iterative evidence-seeking mechanism. It has been trained using a two-stage pipeline:
22
+ 1. **SFT Cold-Start**: Supervised Fine-Tuning on the **RS-EoT-4K** dataset (synthesized via SocraticAgent) to instill the iterative reasoning pattern.
23
+ 2. **Progressive RL**: Reinforcement Learning on Fine-grained Grounding and General VQA tasks to enhance evidence-seeking capabilities and generalize to broader scenarios.
24
+
25
+ ## πŸ› οΈ Quick Start
26
+
27
+ ### Installation
28
+
29
+ Ensure you have the latest `transformers` and `qwen-vl-utils` installed:
30
+
31
+ ```bash
32
+ pip install transformers
33
+ pip install qwen-vl-utils
34
+ ````
35
+
36
+ ### 1\. Visual Question Answering (VQA)
37
+
38
+ This example demonstrates how to ask the model a question and receive a reasoning-backed answer.
39
+
40
+ ```python
41
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
42
+ from qwen_vl_utils import process_vision_info
43
+ import torch
44
+
45
+ # Load model and processor
46
+ model_name = "ShaoRun/RS-EoT-7B"
47
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
48
+ model_name, torch_dtype="auto", device_map="auto"
49
+ )
50
+ processor = AutoProcessor.from_pretrained(model_name)
51
+
52
+ # Define input image (assumes demo.jpg is in the current directory)
53
+ image_path = "./demo.jpg"
54
+
55
+ messages = [
56
+ {
57
+ "role": "user",
58
+ "content": [
59
+ {"type": "image", "image": image_path},
60
+ {"type": "text", "text": "How many cars in this image?"},
61
+ ],
62
+ }
63
+ ]
64
+
65
+ # Preparation for inference
66
+ text = processor.apply_chat_template(
67
+ messages, tokenize=False, add_generation_prompt=True
68
+ )
69
+ image_inputs, video_inputs = process_vision_info(messages)
70
+ inputs = processor(
71
+ text=[text],
72
+ images=image_inputs,
73
+ videos=video_inputs,
74
+ padding=True,
75
+ return_tensors="pt",
76
+ )
77
+ inputs = inputs.to("cuda")
78
+
79
+ # Inference
80
+ generated_ids = model.generate(**inputs, max_new_tokens=4096)
81
+ generated_ids_trimmed = [
82
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
83
+ ]
84
+ output_text = processor.batch_decode(
85
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
86
+ )
87
+
88
+ print(output_text[0])
89
+ ```
90
+
91
+ ### 2\. Visual Grounding with Visualization
92
+
93
+ This example shows how to perform visual grounding and visualize the output bounding boxes.
94
+
95
+ ```python
96
+ import re
97
+ import torch
98
+ from PIL import Image, ImageDraw, ImageFont
99
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
100
+ from qwen_vl_utils import process_vision_info
101
+
102
+ # --- Helper Functions for Parsing and Visualization ---
103
+
104
+ def extract_bbox_list_in(text: str) -> list[list[float]]:
105
+ """Extracts bounding boxes from the model output text."""
106
+ boxes = []
107
+ text = re.sub(r'\\([{}\[\]":,])', r'\1', text)
108
+ # Pattern to find lists of numbers like [x1, y1, x2, y2]
109
+ pattern = re.compile(r'\[\s*(.*?)\s*\]', flags=re.IGNORECASE | re.DOTALL)
110
+ matches = pattern.findall(text)
111
+
112
+ number_pattern = r'-?\d+\.\d+|-?\d+'
113
+ for match in matches:
114
+ nums = re.findall(number_pattern, match)
115
+ if len(nums) >= 4:
116
+ # Take the first 4 numbers as the box
117
+ box = [float(num) for num in nums[:4]]
118
+ boxes.append(box)
119
+ return boxes
120
+
121
+ def visualize_bboxes(img: Image.Image, boxes: list[list[float]], color=(0, 255, 0), width=3) -> Image.Image:
122
+ """Draws bounding boxes on the image."""
123
+ out = img.copy()
124
+ draw = ImageDraw.Draw(out)
125
+ W, H = img.size
126
+
127
+ for b in boxes:
128
+ if len(b) < 4: continue
129
+ x1, y1, x2, y2 = b[:4]
130
+
131
+ # Ensure coordinates are within bounds
132
+ x1, y1 = max(0, min(W-1, x1)), max(0, min(H-1, y1))
133
+ x2, y2 = max(0, min(W-1, x2)), max(0, min(H-1, y2))
134
+
135
+ # Draw rectangle with thickness
136
+ draw.rectangle([x1, y1, x2, y2], outline=color, width=width)
137
+
138
+ return out
139
+
140
+ # --- Main Inference Code ---
141
+
142
+ model_name = "ShaoRun/RS-EoT-7B"
143
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
144
+ model_name, torch_dtype="auto", device_map="auto"
145
+ )
146
+ processor = AutoProcessor.from_pretrained(model_name)
147
+
148
+ # Load Image
149
+ image_path = "./demo.jpg"
150
+ image = Image.open(image_path).convert('RGB')
151
+
152
+ messages = [
153
+ {
154
+ "role": "user",
155
+ "content": [
156
+ {"type": "image", "image": image},
157
+ {"type": "text", "text": 'Locate the black car parked on the right in the remote sensing image. Return the coordinates as "[x1, y1, x2, y2]".'},
158
+ ],
159
+ }
160
+ ]
161
+
162
+ # Process Inputs
163
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
164
+ image_inputs, video_inputs = process_vision_info(messages)
165
+ inputs = processor(
166
+ text=[text],
167
+ images=image_inputs,
168
+ videos=video_inputs,
169
+ padding=True,
170
+ return_tensors="pt",
171
+ )
172
+ inputs = inputs.to("cuda")
173
+
174
+ # Generate
175
+ generated_ids = model.generate(**inputs, max_new_tokens=4096)
176
+ generated_ids_trimmed = [
177
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
178
+ ]
179
+ response = processor.batch_decode(
180
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
181
+ )[0]
182
+
183
+ print(f"Model Response:\n{response}")
184
+
185
+ # Parse and Visualize
186
+ answer_part = response.split("</think>")[-1]
187
+ detection = extract_bbox_list_in(answer_part)
188
+
189
+ if detection:
190
+ print(f"Detected BBoxes: {detection}")
191
+ vis_img = visualize_bboxes(image, detection)
192
+ vis_img.save("./res.jpg")
193
+ print("Visualization saved to ./res.jpg")
194
+ else:
195
+ print("No bounding boxes detected in the response.")
196
+ ```
197
+
198
+ ## πŸ–ŠοΈ Citation
199
+
200
+ If you use this model in your research, please cite our paper:
201
+
202
+ ```bibtex
203
+ coming soon
204
+ ```