--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers base_model: - Qwen/Qwen2.5-VL-7B-Instruct --- # OctoMed Logo

OctoMed-7B ## Introduction OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens. Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks. OctoMed-7B produces internal reasoning traces in \...\ tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries. ## Evaluation ### Medical Benchmark Performances

Medical Benchmark Performances

**Notes:** - Green = OSS smaller models (<10B), Cyan = large proprietary models. - † = 10-sample majority vote ensemble result. ### Legacy Medical Benchmark Performance | Dataset | Setting | Performance | |----------|---------|--------------| | VQA-RAD | Open (Token F1) | 64.23 | | VQA-RAD | Closed (Accuracy) | 85.66 | | SLAKE | Open (Token F1) | 84.96 | | SLAKE | Closed (Accuracy) | 89.66 | We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a **direct** prompt by including the phrase **Answer in a short word or phrase.** at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work. ## Requirements We recommend installing the transformers version used in our experiments and other dependencies with this command: ``` pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14 ``` ## Quickstart Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM.

Inference with HF Transformers 🤗

Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`: ```python import torch from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto" ) # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios. # model = Qwen2_5_VLForConditionalGeneration.from_pretrained( # "OctoMed/OctoMed-7B", # dtype=torch.bfloat16, # attn_implementation="flash_attention_2", # device_map="auto", # ) # The default range for the number of visual tokens per image in the model is 4-16384. # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost. min_pixels = 262144 max_pixels = 262144 processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels) # Text-Only Query # messages = [ # { # "role": "user", # "content": [ # {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"}, # ], # } # ] # General Query # messages = [ # { # "role": "user", # "content": [ # { # "type": "image", # "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg", # }, # {"type": "text", "text": "Describe this image."}, # ], # } # ] # Multiple Choice Query messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg", }, {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to(device="cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=8192) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ```

Inference with vLLM

Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1): ```python from vllm import LLM, SamplingParams from transformers import AutoProcessor min_pixels = 262144 max_pixels = 262144 processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels) llm = LLM( model="OctoMed/OctoMed-7B", trust_remote_code=True, dtype="bfloat16", max_model_len=8192, tensor_parallel_size=4, gpu_memory_utilization=0.8, limit_mm_per_prompt={"image": 1} ) # Set up sampling parameters sampling_params = SamplingParams( temperature=0.6, top_p=0.95, max_tokens=8192, ) image_data = [] # Text-Only Query messages = [ { "role": "user", "content": [ {"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."}, ], } ] # General Query # image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg'] # messages = [ # { # "role": "user", # "content": [ # { # "type": "image", # "image": image_data[0], # }, # {"type": "text", "text": "Describe this image."}, # ], # } # ] # Multiple Choice Query # image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg'] # messages = [ # { # "role": "user", # "content": [ # { # "type": "image", # "image": image_data[0], # }, # {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."}, # ], # } # ] prompt = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True) if image_data: mm_prompt = { "prompt": prompt, "multi_modal_data": {"image": image_data} } else: mm_prompt = {"prompt": prompt} # Generate response outputs = llm.generate([mm_prompt], sampling_params) # Print the generated response for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt}") print(f"Generated text: {generated_text}") print("-" * 50) ```

### Suggested Hyperparameters We suggest using the same settings used in evaluation to reproduce results: Format multiple choice questions with the following template: ``` {optional image(s)} {question} {options, 1 on each line} Please reason step-by-step, and put your final answer within \\boxed{}. ``` Example Prompt: ``` {image(s)} What orientation was the MRI in image B taken in? A: Axial B: Coronal C: Sagittal D: Oblique Please reason step-by-step, and put your final answer within \\boxed{}. ``` - Use the default system prompt ("You are a helpful assistant.") - Extract the answer by looking at the content within the last \\boxed{}. - Temperature of 0.6 - Top-p of 0.95 - min_pixels = 262144 - max_pixels = 262144 ### Known Issues * Model is sensitive to system prompt. We recommend using the default one. * The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so. We hope to address these concerns moving forward in future iterations! ## Citation If you find our work helpful, feel free to give us a cite. ``` @article{ossowski2025octomed, title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning}, author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, Guanghui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung}, journal={arXiv preprint arXiv:2511.23269}, year={2025} } ```