File size: 9,560 Bytes
460cdf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
02c01e3
460cdf1
 
 
 
 
 
f84fa4b
460cdf1
f8505ad
460cdf1
 
 
c77ba93
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301

---
license: apache-2.0
language:
- en
pipeline_tag: image-text-to-text
tags:
- multimodal
library_name: transformers
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---



# <img src="assets/OctoMed.svg" alt="OctoMed Logo" width="100" style="vertical-align:bottom; margin-right:0px;" /> OctoMed-7B

## Introduction

OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens.

Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks.

OctoMed-7B produces internal reasoning traces in \<think>...\</think> tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries.

## Evaluation

### Medical Benchmark Performances

<p align="center">
    <img src="assets/performances.svg" alt="Medical Benchmark Performances" width="100%" />
</p>

**Notes:**  
- Green = OSS smaller models (<10B), Cyan = large proprietary models.  
- † = 10-sample majority vote ensemble result.

### Legacy Medical Benchmark Performance

| Dataset  | Setting | Performance |
|----------|---------|--------------|
| VQA-RAD  | Open (Token F1)    | 64.23        |
| VQA-RAD  | Closed (Accuracy)  | 85.66        |
| SLAKE    | Open (Token F1)   | 84.96        |
| SLAKE    | Closed (Accuracy) | 89.66        |

We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a **direct** prompt by including the phrase **Answer in a short word or phrase.** at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work.


## Requirements
We recommend installing the transformers version used in our experiments and other dependencies with this command:
```
pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14
```

## Quickstart

Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM.

<details>
<summary>Inference with HF Transformers 🤗</summary>
Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`:

```python
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "OctoMed/OctoMed-7B",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
min_pixels = 262144
max_pixels = 262144
processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)

# Text-Only Query
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"},
#         ],
#     }
# ]

# General Query
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
#             },
#             {"type": "text", "text": "Describe this image."},
#         ],
#     }
# ]

# Multiple Choice Query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
            },
            {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

        
inputs = inputs.to(device="cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

```
</details>

<details>
<summary>Inference with vLLM</summary>

Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1):

```python
from vllm import LLM, SamplingParams
from transformers import AutoProcessor

min_pixels = 262144
max_pixels = 262144
processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)

llm = LLM(
    model="OctoMed/OctoMed-7B",
    trust_remote_code=True,
    dtype="bfloat16",
    max_model_len=8192,
    tensor_parallel_size=4,
    gpu_memory_utilization=0.8,
    limit_mm_per_prompt={"image": 1}
)

# Set up sampling parameters
sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=8192,
)

image_data = []

# Text-Only Query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."},
        ],
    }
]

# General Query
# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": image_data[0],
#             },
#             {"type": "text", "text": "Describe this image."},
#         ],
#     }
# ]

# Multiple Choice Query
# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": image_data[0],
#             },
#             {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
#         ],
#     }
# ]

prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True)

if image_data:
    mm_prompt = {
        "prompt": prompt,
        "multi_modal_data": {"image": image_data}
    }
else:
    mm_prompt = {"prompt": prompt}

# Generate response
outputs = llm.generate([mm_prompt], sampling_params)

# Print the generated response
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(f"Generated text: {generated_text}")
    print("-" * 50)
```
</details>



### Suggested Hyperparameters
We suggest using the same settings used in evaluation to reproduce results:

Format multiple choice questions with the following template:
```
{optional image(s)}
{question}
{options, 1 on each line}

Please reason step-by-step, and put your final answer within \\boxed{}.
```

Example Prompt:
```
{image(s)}
What orientation was the MRI in image B taken in?
A: Axial
B: Coronal
C: Sagittal
D: Oblique

Please reason step-by-step, and put your final answer within \\boxed{}.
```
- Use the default system prompt ("You are a helpful assistant.")
- Extract the answer by looking at the content within the last \\boxed{}.
- Temperature of 0.6
- Top-p of 0.95
- min_pixels = 262144
- max_pixels = 262144


### Known Issues
* Model is sensitive to system prompt. We recommend using the default one.
* The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so.

We hope to address these concerns moving forward in future iterations!

## Citation

If you find our work helpful, feel free to give us a cite.

```
@article{ossowski2025octomed,
  title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning},
  author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, Guanghui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung},
  journal={arXiv preprint arXiv:2511.23269},
  year={2025}
}
```