Qwen2.5-VL-KvasirVQA-x1-ft / README.md

SushantGautam

Update README.md

79e7775 verified 15 days ago

preview code

raw

history blame contribute delete

6.21 kB

metadata

base_model: Qwen/Qwen2.5-7B-Instruct
library_name: peft
language:
  - en
tags:
  - vision
  - vqa
  - medical
  - endoscopy
  - kvasir
  - fine-tuning
  - qwen
  - visual-question-answering
license: mit
pipeline_tag: visual-question-answering
datasets:
  - SimulaMet/Kvasir-VQA-x1
training:
  framework: peft
  method: lora
  base_model: Qwen/Qwen2.5-7B-Instruct
  precision: bfloat16
  dataset: SimulaMet/Kvasir-VQA-x1
tasks:
  - visual-question-answering
metrics:
  - accuracy
  - bleu
  - rouge

🧠 Model Card for Kvasir-VQA-x1 Fine-Tuned Models

Fine-tuned vision–language models for Visual Question Answering (VQA) in gastrointestinal (GI) endoscopy, trained on the Kvasir-VQA-x1 benchmark.

🧩 Overview

These models extend strong multimodal backbones (Qwen2.5-VL, Qwen2.5-VL-Transf., and MedGemma) using parameter-efficient LoRA fine-tuning on clinically validated image–question–answer pairs from Kvasir-VQA-x1.
They are designed to generate concise, clinically accurate responses to natural-language questions about endoscopic findings, instruments, and anatomical landmarks.

🔗 Key Resources

Dataset: SimulaMet/Kvasir-VQA-x1
ArXiv: arXiv:2506.09958
GitHub: Simula/Kvasir-VQA-x1
Colab Demo: Usage Notebook ▶️
Published in: Data Engineering in Medical Imaging (DEMI), MICCAI 2025
Springer Chapter: SpringerLink DOI:10.1007/978-3-032-08009-7_6

📊 Model Summary

Model	Base Model	Hugging Face	Training Logs (W&B)
Qwen2.5-VL-KvasirVQA-x1-ft	Qwen2.5-VL-7B-Instruct	🔗 SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft	W&B Run 7mk4gz8s
Qwen2.5-VL-Transf-KvasirVQA-x1-ft	Qwen2.5-VL-7B-Transf.	🔗 SimulaMet/Qwen2.5-VL-Transf-KvasirVQA-x1-ft	W&B Run megwnbz6
MedGemma-KvasirVQA-x1-ft	MedGemma-4B-IT	🔗 SimulaMet/MedGemma-KvasirVQA-x1-ft	W&B Run 7mk4gz8s

⚙️ Training Configuration

Attribute	Specification
GPUs	4–8 × A100 (80 GB)
Precision	bfloat16 (DeepSpeed ZeRO-2)
Frameworks	Transformers + Swift + PEFT
Optimizer	Fused AdamW
Scheduler	Linear / Cosine (model-specific)
Effective Batch Size	36 (MedGemma) / 32 (Qwen)

🧪 Evaluation Highlights

Model	Params	Epochs	LR	LoRA (r/α)	Time	Eval Acc.	Eval Loss
MedGemma-Transf.	4.3 B	4	2e-5	16 / 64	27 h	84.97 %	0.4111
Qwen2.5-VL-Transf.	8.3 B	4	2e-5	16 / 64	30.9 h	85.91 %	0.3883
Qwen2.5-VL	8.3 B	3	2e-5	16 / 64	23 h	85.78 %	0.3906

(Evaluation on 1 % held-out subset of training data.)

🧮 Evaluation Protocol

Traditional n-gram metrics (BLEU, ROUGE) fail to capture clinical correctness, so these models are evaluated using an LLM-based structured adjudicator (Qwen/Qwen3-30B-A3B). Each model prediction is graded per clinical aspect (polyp_type, instrument_presence, etc.) with binary scores and textual justifications:

{
  "eval_json": {
    "polyp_type": {"score": 1, "reason": "Model correctly identified a sessile polyp."},
    "instrument_presence": {"score": 0, "reason": "Failed to mention visible biopsy forceps."}
  }
}

This yields fine-grained, reproducible category-wise accuracy metrics reflecting true clinical reasoning performance. See details in the paper.

🖼️ Usage Example

!pip install ms-swift==3.8.0 bitsandbytes qwen_vl_utils==0.0.11

import torch
from swift.llm import PtEngine, RequestConfig, InferRequest
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

engine = PtEngine(
    adapters=["SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft"],  # or use other fine-tuned model IDs
    model_id_or_path="Qwen/Qwen2.5-VL-7B-Instruct",  # or use other base model IDs
    quantization_config=bnb_config,
    attn_impl="sdpa",
    use_hf=True,
)

req_cfg = RequestConfig(max_tokens=512, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.05)

infer_requests = [
    InferRequest(messages=[{
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1/resolve/main/images/clb0kvxvm90y4074yf50vf5nq.jpg"},
            {"type": "text", "text": "What is shown in the image?"}
        ],
    }])
]

resp = engine.infer(infer_requests, req_cfg)
print(resp[0].choices[0].message.content)

👉 See detailed examples in the Colab usage notebook.

📄 License

See base model-specific LICENSEs.

📢 Citation

If you use these models or the dataset, please cite:

@incollection{Gautam2025Oct,
  author    = {Gautam, Sushant and Riegler, Michael and Halvorsen, P{\aa}l},
  title     = {{Kvasir-VQA-x1:A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy}},
  booktitle = {{Data Engineering in Medical Imaging}},
  journal   = {SpringerLink},
  pages     = {53--63},
  year      = {2025},
  month     = oct,
  isbn      = {978-3-032-08009-7},
  publisher = {Springer},
  address   = {Cham, Switzerland},
  doi       = {10.1007/978-3-032-08009-7_6}
}