policy optimization
Collection
4 items
•
Updated
This model was trained as part of the Reinforcement Learning - 24 project at Peking University, focusing on [simpo].
This model is a fine-tuned version of Qwen/Qwen2-1.5B-Instruct on the princeton-nlp/llama3-ultrafeedback dataset. It achieves the following results on the evaluation set:
More information needed
More information needed
More information needed
The following hyperparameters were used during training:
| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.6402 | 0.8549 | 400 | 1.6353 | -2.6155 | -2.7990 | 0.5726 | 0.1835 | -1.1196 | -1.0462 | -1.5085 | -1.4841 |
Base model
Qwen/Qwen2-1.5B-Instruct