D2VLM Models

Here we provided the pre-trained D2VLM models. The performance on the E.T. Bench is shown below.

Model Name Referring (Acc) Grounding (F1) Dense Captioning (F1) Dense Captioning (Sim) Complex (Recall)
D2VLM 25.3 42.3 37.5 21.8 18.1
D2VLM_mcqa_enhanced 38.3 44.3 37.2 21.4 18.6

Some Notes

  1. For the Referring tasks of E.T.Bench (RAR/EVC/RVQ), we adopt a more stringent evaluation protocol compared with the original E.T. Bench, which usually results in lower metric values (e.g., a drop of more than 10% for some existing methods when using our stringent metrics).

  2. To enhance basic instruction-following capability, we incorporate automatically constructed multiple-choice questions during the proposed factorized preference optimization process. Due to our proposed factorized preference data synthesis, we can easily generate diverse distractor options based on different causes of failure and combine them with the original correct answer to form multiple-choice questions, without requiring additional external data sources. We define the resulting model as "D2VLM_mcqa_enhanced".

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support