base_model: Qwen/Qwen2.5-7B-Instruct
library_name: peft
language:
- en
tags:
- vision
- vqa
- medical
- endoscopy
- kvasir
- fine-tuning
- qwen
- visual-question-answering
license: mit
pipeline_tag: visual-question-answering
datasets:
- SimulaMet/Kvasir-VQA-x1
training:
framework: peft
method: lora
base_model: Qwen/Qwen2.5-7B-Instruct
precision: bfloat16
dataset: SimulaMet/Kvasir-VQA-x1
tasks:
- visual-question-answering
metrics:
- accuracy
- bleu
- rouge
🧠 Model Card for Kvasir-VQA-x1 Fine-Tuned Models
Fine-tuned vision–language models for Visual Question Answering (VQA) in gastrointestinal (GI) endoscopy, trained on the Kvasir-VQA-x1 benchmark.
🧩 Overview
These models extend strong multimodal backbones (Qwen2.5-VL, Qwen2.5-VL-Transf., and MedGemma) using parameter-efficient LoRA fine-tuning on clinically validated image–question–answer pairs from Kvasir-VQA-x1.
They are designed to generate concise, clinically accurate responses to natural-language questions about endoscopic findings, instruments, and anatomical landmarks.
🔗 Key Resources
- Dataset: SimulaMet/Kvasir-VQA-x1
- ArXiv: arXiv:2506.09958
- GitHub: Simula/Kvasir-VQA-x1
- Colab Demo: Usage Notebook ▶️
- Published in: Data Engineering in Medical Imaging (DEMI), MICCAI 2025
- Springer Chapter: SpringerLink DOI:10.1007/978-3-032-08009-7_6
📊 Model Summary
| Model | Base Model | Hugging Face | Training Logs (W&B) |
|---|---|---|---|
| Qwen2.5-VL-KvasirVQA-x1-ft | Qwen2.5-VL-7B-Instruct | 🔗 SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft | W&B Run 7mk4gz8s |
| Qwen2.5-VL-Transf-KvasirVQA-x1-ft | Qwen2.5-VL-7B-Transf. | 🔗 SimulaMet/Qwen2.5-VL-Transf-KvasirVQA-x1-ft | W&B Run megwnbz6 |
| MedGemma-KvasirVQA-x1-ft | MedGemma-4B-IT | 🔗 SimulaMet/MedGemma-KvasirVQA-x1-ft | W&B Run 7mk4gz8s |
⚙️ Training Configuration
| Attribute | Specification |
|---|---|
| GPUs | 4–8 × A100 (80 GB) |
| Precision | bfloat16 (DeepSpeed ZeRO-2) |
| Frameworks | Transformers + Swift + PEFT |
| Optimizer | Fused AdamW |
| Scheduler | Linear / Cosine (model-specific) |
| Effective Batch Size | 36 (MedGemma) / 32 (Qwen) |
🧪 Evaluation Highlights
| Model | Params | Epochs | LR | LoRA (r/α) | Time | Eval Acc. | Eval Loss |
|---|---|---|---|---|---|---|---|
| MedGemma-Transf. | 4.3 B | 4 | 2e-5 | 16 / 64 | 27 h | 84.97 % | 0.4111 |
| Qwen2.5-VL-Transf. | 8.3 B | 4 | 2e-5 | 16 / 64 | 30.9 h | 85.91 % | 0.3883 |
| Qwen2.5-VL | 8.3 B | 3 | 2e-5 | 16 / 64 | 23 h | 85.78 % | 0.3906 |
(Evaluation on 1 % held-out subset of training data.)
🧮 Evaluation Protocol
Traditional n-gram metrics (BLEU, ROUGE) fail to capture clinical correctness, so these models are evaluated using an LLM-based structured adjudicator (Qwen/Qwen3-30B-A3B). Each model prediction is graded per clinical aspect (polyp_type, instrument_presence, etc.) with binary scores and textual justifications:
{
"eval_json": {
"polyp_type": {"score": 1, "reason": "Model correctly identified a sessile polyp."},
"instrument_presence": {"score": 0, "reason": "Failed to mention visible biopsy forceps."}
}
}
This yields fine-grained, reproducible category-wise accuracy metrics reflecting true clinical reasoning performance. See details in the paper.
🖼️ Usage Example
!pip install ms-swift==3.8.0 bitsandbytes qwen_vl_utils==0.0.11
import torch
from swift.llm import PtEngine, RequestConfig, InferRequest
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
engine = PtEngine(
adapters=["SimulaMet/Qwen2.5-VL-KvasirVQA-x1-ft"], # or use other fine-tuned model IDs
model_id_or_path="Qwen/Qwen2.5-VL-7B-Instruct", # or use other base model IDs
quantization_config=bnb_config,
attn_impl="sdpa",
use_hf=True,
)
req_cfg = RequestConfig(max_tokens=512, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.05)
infer_requests = [
InferRequest(messages=[{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1/resolve/main/images/clb0kvxvm90y4074yf50vf5nq.jpg"},
{"type": "text", "text": "What is shown in the image?"}
],
}])
]
resp = engine.infer(infer_requests, req_cfg)
print(resp[0].choices[0].message.content)
👉 See detailed examples in the Colab usage notebook.
📄 License
See base model-specific LICENSEs.
📢 Citation
If you use these models or the dataset, please cite:
@incollection{Gautam2025Oct,
author = {Gautam, Sushant and Riegler, Michael and Halvorsen, P{\aa}l},
title = {{Kvasir-VQA-x1:A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy}},
booktitle = {{Data Engineering in Medical Imaging}},
journal = {SpringerLink},
pages = {53--63},
year = {2025},
month = oct,
isbn = {978-3-032-08009-7},
publisher = {Springer},
address = {Cham, Switzerland},
doi = {10.1007/978-3-032-08009-7_6}
}