Qwen2-VL-2B Fine-tuned on Pagoda Dataset (LoRA)

This model is a fine-tuned version of Qwen/Qwen2-VL-2B-Instruct using LoRA (Low-Rank Adaptation) on the Pagoda text-and-image dataset.

🎯 Model Description

This is a multimodal vision-language model that has been fine-tuned to understand and describe images with a focus on pagoda-related content. The model uses LoRA adapters applied to both the vision encoder and language decoder, enabling efficient fine-tuning while maintaining high performance.

Key Features:

✅ Multimodal Fine-tuning: Both vision and language components were trained
✅ Efficient Training: 4-bit quantization with LoRA (only ~1% of parameters trained)
✅ Memory Optimized: Trained with aggressive memory optimizations
✅ Fast Inference: Maintains the base model's inference speed
✅ Production Ready: Includes processor and generation configs

📊 Training Details

Hardware

GPU: 1x NVIDIA H200 (140.4 GB VRAM)
- VRAM Usage: ~6.7 GB / 140.4 GB
- Memory Bandwidth: 4051.9 GB/s
- Compute: 53.5 TFLOPS
CPU: Intel Xeon Platinum 8568Y+ (96 cores, 12 used)
RAM: 387.0 GB (3 GB used)
Storage: Dell Enterprise NVMe PM1735a 3.2TB
- Read Speed: 24985.6 MB/s
Network: 1250 Mbps (download/upload ~7.3 Gbps)
Platform: Vast.ai Instance #27922151
CUDA Version: 12.8

Dataset

Source: nojiyoon/pagoda-text-and-image-dataset-small
Samples Used: 1000 samples
Train/Val Split: 900 / 100 (90% / 10%)
Image Processing: Resized to max 280×280px to reduce memory
Text Processing: Truncated to 50 characters max

Model Architecture

Base Model: Qwen2-VL-2B-Instruct
Total Parameters: ~2 Billion
LoRA Rank: 8
LoRA Alpha: 16
LoRA Dropout: 0.05
Target Modules:
- Vision: qkv (attention projections in vision transformer)
- Language: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable Parameters: ~~15-20 million (~~0.75-1% of total)

Training Hyperparameters

Quantization: 4-bit NF4 with double quantization
Compute Dtype: bfloat16
Epochs: 1
Batch Size: 1 per device
Gradient Accumulation: 8 steps
Effective Batch Size: 8
Learning Rate: 2e-4
LR Scheduler: Cosine with warmup
Warmup Steps: 50
Optimizer: PagedAdamW 8-bit
Gradient Clipping: 1.0
Gradient Checkpointing: Enabled (non-reentrant)
Mixed Precision: bfloat16

Training Configuration

Image Resolution: 28×28 to 280×280 pixels (min/max)
Sequence Length: Dynamic (no max truncation to avoid token mismatch)
Data Workers: 0 (memory optimization)
Pin Memory: Disabled
Training Time: ~15-20 minutes
Evaluation Strategy: Every 100 steps
Save Strategy: Every 100 steps (keep only best)

Memory Optimizations

4-bit quantization reduces model size by 75%
Small image resolution (280px max) reduces vision tokens
Aggressive text truncation (50 chars) reduces sequence length
Gradient checkpointing reduces activation memory
No data loader workers to save RAM
Batch size of 1 with gradient accumulation

🚀 Usage

Installation

pip install transformers accelerate peft pillow torch
# For 4-bit quantization (optional but recommended):
pip install bitsandbytes

Quick Start

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch

# Load base model with 4-bit quantization
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, "ahczhg/qwen3-vl-2b-pagoda-lora")
model.eval()

# Load processor
processor = AutoProcessor.from_pretrained(
    "ahczhg/qwen3-vl-2b-pagoda-lora",
    trust_remote_code=True
)

# Load and process image
image = Image.open("your_image.jpg")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Generate response
text = processor.apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(
    text=[text],
    images=[[image]],
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]
print(response)

Advanced: Merge LoRA Weights

For faster inference, you can merge the LoRA weights into the base model:

from peft import PeftModel

# Load and merge
model = PeftModel.from_pretrained(base_model, "ahczhg/qwen3-vl-2b-pagoda-lora")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
processor.save_pretrained("./merged_model")

📈 Performance

Training Metrics

Final Training Loss: ~X.XX (update after training)
Final Validation Loss: ~X.XX (update after training)
Training Steps: ~113 steps (900 samples / 8 effective batch size)

Inference Speed

With LoRA Adapter: ~Same as base model
Merged Model: Identical to base model
Memory Usage: ~6-8 GB VRAM (4-bit quantization)

🎨 Example Outputs

Input: [Image of a pagoda]
Prompt: "Describe this image in detail."

Output: [Model-generated description]

⚠️ Limitations

Limited Training Data: Only 1000 samples used (demonstration purposes)
Single Epoch: Model may benefit from additional training epochs
Domain Specific: Optimized for pagoda-related content
Text Truncation: Training text limited to 50 characters
Image Resolution: Training images resized to 280px max
Quantization: 4-bit quantization may slightly reduce quality vs full precision

🔄 Potential Improvements

More Training Data: Expand to full dataset
Longer Training: Train for 3-5 epochs
Higher LoRA Rank: Increase from 8 to 16 or 32
Larger Images: Train with 512px or higher resolution
Longer Text: Remove text truncation for better descriptions
Full Fine-tuning: Fine-tune all parameters (requires more compute)

📝 License

This model inherits the license from the base model:

License: Apache 2.0
Base Model: Qwen/Qwen2-VL-2B-Instruct

🙏 Acknowledgments

Base Model: Qwen Team for Qwen2-VL-2B-Instruct
Dataset: nojiyoon/pagoda-text-and-image-dataset-small
Infrastructure: Vast.ai for H200 GPU access
Framework: HuggingFace Transformers, PEFT, bitsandbytes

📚 Citation

@misc{qwen2vl-pagoda-lora,
  author = {Your Name},
  title = {Qwen2-VL-2B Fine-tuned on Pagoda Dataset},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ahczhg/qwen3-vl-2b-pagoda-lora}}
}

🔗 Links

Base Model: Qwen/Qwen2-VL-2B-Instruct
Dataset: nojiyoon/pagoda-text-and-image-dataset-small
PEFT Library: https://github.com/huggingface/peft

Downloads last month: 28

Model tree for ahczhg/qwen3-vl-2b-pagoda-archaeology-lora

Base model

Qwen/Qwen2-VL-2B

Finetuned

Qwen/Qwen2-VL-2B-Instruct

Adapter

(98)

this model

ahczhg
/

qwen3-vl-2b-pagoda-archaeology-lora