Upload 7 files

Browse files

Files changed (7) hide show

.gitattributes +0 -1
README.md +115 -0
added_tokens.json +1 -0
config.json +37 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
vocab.txt +0 -0

.gitattributes CHANGED Viewed

@@ -25,7 +25,6 @@
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text

 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,118 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+datasets:
+- michelecafagna26/hl
+language:
+- en
+metrics:
+- sacrebleu
+- rouge
+- meteor
+- spice
+- cider
+library_name: pytorch
+tags:
+- pytorch
+- image-to-text
 ---
+# Model Card: VinVL for Captioning 🖼️
+[Microsoft's VinVL](https://github.com/microsoft/Oscar) base fine-tuned on [HL dataset](https://arxiv.org/abs/2302.12189?context=cs.CL)  for **rationale description generation** downstream task.
+# Model fine-tuning 🏋️‍
+The model has been finetuned for 10 epochs on the scenes captions of the [HL dataset](https://arxiv.org/abs/2302.12189?context=cs.CL) (available on 🤗 HUB: [michelecafagna26/hl](https://huggingface.co/datasets/michelecafagna26/hl))
+# Test set metrics 📈
+Obtained with beam size 5 and max length 20
+| Bleu-1 | Bleu-2 | Bleu-3 | Bleu-4 | METEOR | ROUGE-L | CIDEr | SPICE |
+|--------|--------|--------|--------|--------|---------|-------|-------|
+|  0.55  |  0.38  |  0.23  |  0.15  |  0.17  |  0.44   |  0.44 |  0.10 |
+# Usage and Installation:
+More info about how to install and use this model can be found here: [michelecafagna26/VinVL
+](https://github.com/michelecafagna26/VinVL)
+# Feature extraction ⛏️
+This model has a separate Visualbackbone used to extract features.
+More info about:
+- the model: [michelecafagna26/vinvl_vg_x152c4](https://huggingface.co/michelecafagna26/vinvl_vg_x152c4)
+- the usage: [michelecafagna26/vinvl-visualbackbone](https://github.com/michelecafagna26/vinvl-visualbackbone)
+# Quick start: 🚀
+```python
+from transformers.pytorch_transformers import BertConfig, BertTokenizer
+from oscar.modeling.modeling_bert import BertForImageCaptioning
+from oscar.wrappers import OscarTensorizer
+ckpt = "path/to/the/checkpoint"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# original code
+config = BertConfig.from_pretrained(ckpt)
+tokenizer = BertTokenizer.from_pretrained(ckpt)
+model = BertForImageCaptioning.from_pretrained(ckpt, config=config).to(device)
+# This takes care of the preprocessing
+tensorizer = OscarTensorizer(tokenizer=tokenizer, device=device)
+# numpy-arrays with shape (1, num_boxes, feat_size)
+# feat_size is 2054 by default in VinVL
+visual_features = torch.from_numpy(feat_obj).to(device).unsqueeze(0)
+# labels are usually extracted by the features extractor
+labels = [['boat', 'boat', 'boat', 'bottom', 'bush', 'coat', 'deck', 'deck', 'deck', 'dock', 'hair', 'jacket']]
+inputs = tensorizer.encode(visual_features, labels=labels)
+outputs = model(**inputs)
+pred = tensorizer.decode(outputs)
+# the output looks like this:
+# pred = {0: [{'caption': 'he is on leisure', 'conf': 0.7070220112800598]}
+```
+# Citations 🧾
+HL Dataset paper:
+```BibTeX
+@inproceedings{cafagna2023hl,
+  title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
+{R}ationales},
+  author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
+  booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
+address = {Prague, Czech Republic},
+  year={2023}
+}
+```
+Please consider citing the original project and the VinVL paper
+```BibTeX
+@misc{han2021image,
+      title={Image Scene Graph Generation (SGG) Benchmark},
+      author={Xiaotian Han and Jianwei Yang and Houdong Hu and Lei Zhang and Jianfeng Gao and Pengchuan Zhang},
+      year={2021},
+      eprint={2107.12604},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+@inproceedings{zhang2021vinvl,
+  title={Vinvl: Revisiting visual representations in vision-language models},
+  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={5579--5588},
+  year={2021}
+}
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {}

config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "drop_worst_after": 20000,
+  "drop_worst_ratio": 0.2,
+  "finetuning_task": "image_captioning",
+  "freeze_embedding": true,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "img_feature_dim": 2054,
+  "img_feature_type": "frcnn",
+  "img_layer_norm_eps": 1e-12,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "is_bert": true,
+  "label_smoothing": 0.1,
+  "language_model_type": "MLM",
+  "layer_norm_eps": 1e-12,
+  "loss_type": "sfmx",
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_contrast_classes": 3,
+  "num_hidden_layers": 12,
+  "num_labels": 2,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "pad_token_id": 0,
+  "tie_weights": true,
+  "torchscript": false,
+  "type_vocab_size": 2,
+  "use_img_layernorm": 0,
+  "vocab_size": 30522
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e5d647a6608c0058687c3737f8aa98f2232e0374b3375076d4e7ca84103fa992
+size 446817260

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff