Qwen2-VL-2B Fine-tuned on Pagoda Dataset (LoRA)

This model is a fine-tuned version of Qwen/Qwen2-VL-2B-Instruct using LoRA (Low-Rank Adaptation) on the Pagoda text-and-image dataset.

🎯 Model Description

This is a multimodal vision-language model that has been fine-tuned to understand and describe images with a focus on pagoda-related content. The model uses LoRA adapters applied to both the vision encoder and language decoder, enabling efficient fine-tuning while maintaining high performance.

Key Features:

  • βœ… Multimodal Fine-tuning: Both vision and language components were trained
  • βœ… Efficient Training: 4-bit quantization with LoRA (only ~1% of parameters trained)
  • βœ… Memory Optimized: Trained with aggressive memory optimizations
  • βœ… Fast Inference: Maintains the base model's inference speed
  • βœ… Production Ready: Includes processor and generation configs

πŸ“Š Training Details

Hardware

  • GPU: 1x NVIDIA H200 (140.4 GB VRAM)
    • VRAM Usage: ~6.7 GB / 140.4 GB
    • Memory Bandwidth: 4051.9 GB/s
    • Compute: 53.5 TFLOPS
  • CPU: Intel Xeon Platinum 8568Y+ (96 cores, 12 used)
  • RAM: 387.0 GB (3 GB used)
  • Storage: Dell Enterprise NVMe PM1735a 3.2TB
    • Read Speed: 24985.6 MB/s
  • Network: 1250 Mbps (download/upload ~7.3 Gbps)
  • Platform: Vast.ai Instance #27922151
  • CUDA Version: 12.8

Dataset

  • Source: nojiyoon/pagoda-text-and-image-dataset-small
  • Samples Used: 1000 samples
  • Train/Val Split: 900 / 100 (90% / 10%)
  • Image Processing: Resized to max 280Γ—280px to reduce memory
  • Text Processing: Truncated to 50 characters max

Model Architecture

  • Base Model: Qwen2-VL-2B-Instruct
  • Total Parameters: ~2 Billion
  • LoRA Rank: 8
  • LoRA Alpha: 16
  • LoRA Dropout: 0.05
  • Target Modules:
    • Vision: qkv (attention projections in vision transformer)
    • Language: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Trainable Parameters: 15-20 million (0.75-1% of total)

Training Hyperparameters

  • Quantization: 4-bit NF4 with double quantization
  • Compute Dtype: bfloat16
  • Epochs: 1
  • Batch Size: 1 per device
  • Gradient Accumulation: 8 steps
  • Effective Batch Size: 8
  • Learning Rate: 2e-4
  • LR Scheduler: Cosine with warmup
  • Warmup Steps: 50
  • Optimizer: PagedAdamW 8-bit
  • Gradient Clipping: 1.0
  • Gradient Checkpointing: Enabled (non-reentrant)
  • Mixed Precision: bfloat16

Training Configuration

  • Image Resolution: 28Γ—28 to 280Γ—280 pixels (min/max)
  • Sequence Length: Dynamic (no max truncation to avoid token mismatch)
  • Data Workers: 0 (memory optimization)
  • Pin Memory: Disabled
  • Training Time: ~15-20 minutes
  • Evaluation Strategy: Every 100 steps
  • Save Strategy: Every 100 steps (keep only best)

Memory Optimizations

  • 4-bit quantization reduces model size by 75%
  • Small image resolution (280px max) reduces vision tokens
  • Aggressive text truncation (50 chars) reduces sequence length
  • Gradient checkpointing reduces activation memory
  • No data loader workers to save RAM
  • Batch size of 1 with gradient accumulation

πŸš€ Usage

Installation

pip install transformers accelerate peft pillow torch
# For 4-bit quantization (optional but recommended):
pip install bitsandbytes

Quick Start

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch

# Load base model with 4-bit quantization
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, "ahczhg/qwen3-vl-2b-pagoda-lora")
model.eval()

# Load processor
processor = AutoProcessor.from_pretrained(
    "ahczhg/qwen3-vl-2b-pagoda-lora",
    trust_remote_code=True
)

# Load and process image
image = Image.open("your_image.jpg")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Generate response
text = processor.apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(
    text=[text],
    images=[[image]],
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]
print(response)

Advanced: Merge LoRA Weights

For faster inference, you can merge the LoRA weights into the base model:

from peft import PeftModel

# Load and merge
model = PeftModel.from_pretrained(base_model, "ahczhg/qwen3-vl-2b-pagoda-lora")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
processor.save_pretrained("./merged_model")

πŸ“ˆ Performance

Training Metrics

  • Final Training Loss: ~X.XX (update after training)
  • Final Validation Loss: ~X.XX (update after training)
  • Training Steps: ~113 steps (900 samples / 8 effective batch size)

Inference Speed

  • With LoRA Adapter: ~Same as base model
  • Merged Model: Identical to base model
  • Memory Usage: ~6-8 GB VRAM (4-bit quantization)

🎨 Example Outputs

Input: [Image of a pagoda]
Prompt: "Describe this image in detail."

Output: [Model-generated description]

⚠️ Limitations

  • Limited Training Data: Only 1000 samples used (demonstration purposes)
  • Single Epoch: Model may benefit from additional training epochs
  • Domain Specific: Optimized for pagoda-related content
  • Text Truncation: Training text limited to 50 characters
  • Image Resolution: Training images resized to 280px max
  • Quantization: 4-bit quantization may slightly reduce quality vs full precision

πŸ”„ Potential Improvements

  1. More Training Data: Expand to full dataset
  2. Longer Training: Train for 3-5 epochs
  3. Higher LoRA Rank: Increase from 8 to 16 or 32
  4. Larger Images: Train with 512px or higher resolution
  5. Longer Text: Remove text truncation for better descriptions
  6. Full Fine-tuning: Fine-tune all parameters (requires more compute)

πŸ“ License

This model inherits the license from the base model:

πŸ™ Acknowledgments

πŸ“š Citation

@misc{qwen2vl-pagoda-lora,
  author = {Your Name},
  title = {Qwen2-VL-2B Fine-tuned on Pagoda Dataset},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ahczhg/qwen3-vl-2b-pagoda-lora}}
}

πŸ”— Links

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ahczhg/qwen3-vl-2b-pagoda-archaeology-lora

Base model

Qwen/Qwen2-VL-2B
Adapter
(98)
this model

Dataset used to train ahczhg/qwen3-vl-2b-pagoda-archaeology-lora