Model Card: clip-imageclef
Model Details
OpenAI CLIP model fine-tuned using image-caption pairs from the Caption Prediction dataset provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively.
Model Date
September 6, 2021
Model Type
The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
Fine-tuning
The fine-tuning can be reproduced using code from the Github repository elsevierlabs-os/clip-image-search.
Usage
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("sujitpal/clip-imageclef")
processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=captions, images=images,
return_tensors="pt", padding=True)
output = model(**inputs)
Performance
| Model-name |
k=1 |
k=3 |
k=5 |
k=10 |
k=20 |
| zero-shot CLIP (baseline) |
0.426 |
0.534 |
0.558 |
0.573 |
0.578 |
| clip-imageclef (this model) |
0.802 |
0.872 |
0.877 |
0.879 |
0.880 |