dpo_40k_abla_all_eight

This model is a fine-tuned version of /p/scratch/taco-vlm/xiao4/models/Qwen2.5-VL-7B-Instruct on the dpo_ablation_all_eight dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 64
total_eval_batch_size: 4
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6888	0.0806	50	0.6891	-0.0097	-0.0183	0.5800	0.0086	-32.0277	-36.8175	0.4651	0.4536
0.6637	0.1612	100	0.6643	-0.1036	-0.1689	0.6900	0.0653	-32.9663	-38.3233	0.4729	0.4553
0.6275	0.2418	150	0.6287	-0.2312	-0.3882	0.6650	0.1570	-34.2426	-40.5160	0.4366	0.4244
0.5805	0.3225	200	0.5982	-0.3261	-0.5805	0.7050	0.2545	-35.1910	-42.4393	0.3993	0.3911
0.5132	0.4031	250	0.5752	-0.3879	-0.7462	0.7050	0.3583	-35.8094	-44.0962	0.3547	0.3346
0.5218	0.4837	300	0.5598	-0.3934	-0.8424	0.7250	0.4490	-35.8645	-45.0584	0.3146	0.2951
0.449	0.5643	350	0.5505	-0.4804	-1.0050	0.7250	0.5246	-36.7344	-46.6842	0.2733	0.2540
0.4075	0.6449	400	0.5391	-0.4772	-1.0612	0.7150	0.5840	-36.7021	-47.2460	0.2404	0.2324
0.5689	0.7255	450	0.5325	-0.5299	-1.1545	0.7150	0.6246	-37.2289	-48.1790	0.2281	0.2155
0.4456	0.8061	500	0.5280	-0.5577	-1.1977	0.7200	0.6400	-37.5073	-48.6110	0.2110	0.1989
0.5101	0.8867	550	0.5262	-0.5693	-1.2199	0.7300	0.6505	-37.6238	-48.8327	0.2072	0.1892
0.4293	0.9674	600	0.5246	-0.5665	-1.2212	0.7300	0.6547	-37.5951	-48.8461	0.2064	0.1921

Base model

Adapter

(133)

this model