Qwen2-VL-2B Fine-tuned on Pagoda Dataset (LoRA)
This model is a fine-tuned version of Qwen/Qwen2-VL-2B-Instruct using LoRA (Low-Rank Adaptation) on the Pagoda text-and-image dataset.
π― Model Description
This is a multimodal vision-language model that has been fine-tuned to understand and describe images with a focus on pagoda-related content. The model uses LoRA adapters applied to both the vision encoder and language decoder, enabling efficient fine-tuning while maintaining high performance.
Key Features:
- β Multimodal Fine-tuning: Both vision and language components were trained
- β Efficient Training: 4-bit quantization with LoRA (only ~1% of parameters trained)
- β Memory Optimized: Trained with aggressive memory optimizations
- β Fast Inference: Maintains the base model's inference speed
- β Production Ready: Includes processor and generation configs
π Training Details
Hardware
- GPU: 1x NVIDIA H200 (140.4 GB VRAM)
- VRAM Usage: ~6.7 GB / 140.4 GB
- Memory Bandwidth: 4051.9 GB/s
- Compute: 53.5 TFLOPS
- CPU: Intel Xeon Platinum 8568Y+ (96 cores, 12 used)
- RAM: 387.0 GB (3 GB used)
- Storage: Dell Enterprise NVMe PM1735a 3.2TB
- Read Speed: 24985.6 MB/s
- Network: 1250 Mbps (download/upload ~7.3 Gbps)
- Platform: Vast.ai Instance #27922151
- CUDA Version: 12.8
Dataset
- Source: nojiyoon/pagoda-text-and-image-dataset-small
- Samples Used: 1000 samples
- Train/Val Split: 900 / 100 (90% / 10%)
- Image Processing: Resized to max 280Γ280px to reduce memory
- Text Processing: Truncated to 50 characters max
Model Architecture
- Base Model: Qwen2-VL-2B-Instruct
- Total Parameters: ~2 Billion
- LoRA Rank: 8
- LoRA Alpha: 16
- LoRA Dropout: 0.05
- Target Modules:
- Vision:
qkv(attention projections in vision transformer) - Language:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
- Vision:
- Trainable Parameters:
15-20 million (0.75-1% of total)
Training Hyperparameters
- Quantization: 4-bit NF4 with double quantization
- Compute Dtype: bfloat16
- Epochs: 1
- Batch Size: 1 per device
- Gradient Accumulation: 8 steps
- Effective Batch Size: 8
- Learning Rate: 2e-4
- LR Scheduler: Cosine with warmup
- Warmup Steps: 50
- Optimizer: PagedAdamW 8-bit
- Gradient Clipping: 1.0
- Gradient Checkpointing: Enabled (non-reentrant)
- Mixed Precision: bfloat16
Training Configuration
- Image Resolution: 28Γ28 to 280Γ280 pixels (min/max)
- Sequence Length: Dynamic (no max truncation to avoid token mismatch)
- Data Workers: 0 (memory optimization)
- Pin Memory: Disabled
- Training Time: ~15-20 minutes
- Evaluation Strategy: Every 100 steps
- Save Strategy: Every 100 steps (keep only best)
Memory Optimizations
- 4-bit quantization reduces model size by 75%
- Small image resolution (280px max) reduces vision tokens
- Aggressive text truncation (50 chars) reduces sequence length
- Gradient checkpointing reduces activation memory
- No data loader workers to save RAM
- Batch size of 1 with gradient accumulation
π Usage
Installation
pip install transformers accelerate peft pillow torch
# For 4-bit quantization (optional but recommended):
pip install bitsandbytes
Quick Start
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch
# Load base model with 4-bit quantization
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct",
device_map="auto",
trust_remote_code=True,
torch_dtype=torch.bfloat16
)
# Load LoRA adapter
model = PeftModel.from_pretrained(model, "ahczhg/qwen3-vl-2b-pagoda-lora")
model.eval()
# Load processor
processor = AutoProcessor.from_pretrained(
"ahczhg/qwen3-vl-2b-pagoda-lora",
trust_remote_code=True
)
# Load and process image
image = Image.open("your_image.jpg")
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
# Generate response
text = processor.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(
text=[text],
images=[[image]],
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9
)
response = processor.batch_decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
print(response)
Advanced: Merge LoRA Weights
For faster inference, you can merge the LoRA weights into the base model:
from peft import PeftModel
# Load and merge
model = PeftModel.from_pretrained(base_model, "ahczhg/qwen3-vl-2b-pagoda-lora")
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged_model")
processor.save_pretrained("./merged_model")
π Performance
Training Metrics
- Final Training Loss: ~X.XX (update after training)
- Final Validation Loss: ~X.XX (update after training)
- Training Steps: ~113 steps (900 samples / 8 effective batch size)
Inference Speed
- With LoRA Adapter: ~Same as base model
- Merged Model: Identical to base model
- Memory Usage: ~6-8 GB VRAM (4-bit quantization)
π¨ Example Outputs
Input: [Image of a pagoda]
Prompt: "Describe this image in detail."
Output: [Model-generated description]
β οΈ Limitations
- Limited Training Data: Only 1000 samples used (demonstration purposes)
- Single Epoch: Model may benefit from additional training epochs
- Domain Specific: Optimized for pagoda-related content
- Text Truncation: Training text limited to 50 characters
- Image Resolution: Training images resized to 280px max
- Quantization: 4-bit quantization may slightly reduce quality vs full precision
π Potential Improvements
- More Training Data: Expand to full dataset
- Longer Training: Train for 3-5 epochs
- Higher LoRA Rank: Increase from 8 to 16 or 32
- Larger Images: Train with 512px or higher resolution
- Longer Text: Remove text truncation for better descriptions
- Full Fine-tuning: Fine-tune all parameters (requires more compute)
π License
This model inherits the license from the base model:
- License: Apache 2.0
- Base Model: Qwen/Qwen2-VL-2B-Instruct
π Acknowledgments
- Base Model: Qwen Team for Qwen2-VL-2B-Instruct
- Dataset: nojiyoon/pagoda-text-and-image-dataset-small
- Infrastructure: Vast.ai for H200 GPU access
- Framework: HuggingFace Transformers, PEFT, bitsandbytes
π Citation
@misc{qwen2vl-pagoda-lora,
author = {Your Name},
title = {Qwen2-VL-2B Fine-tuned on Pagoda Dataset},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/ahczhg/qwen3-vl-2b-pagoda-lora}}
}
π Links
- Base Model: Qwen/Qwen2-VL-2B-Instruct
- Dataset: nojiyoon/pagoda-text-and-image-dataset-small
- PEFT Library: https://github.com/huggingface/peft
- Downloads last month
- 28