training_args = GRPOConfig(
vllm_sampling_params = vllm_sampling_params,
# max_grad_norm = 0.1,
# beta = 0.001,
temperature = 1.0,
learning_rate = 1e-5,
weight_decay = 0.01,
warmup_ratio = 0.01,
lr_scheduler_type = "linear",
optim = "adamw_8bit",
logging_steps = 1,
per_device_train_batch_size = 12,
gradient_accumulation_steps = 1,
num_generations = 4,
# steps_per_generation = 16,
max_prompt_length = max_prompt_length,
max_completion_length = max_completion_length,
# num_train_epochs = 1, # Set to 1 for a full training run
max_steps = 1600,
save_steps = 250,
save_total_limit = 10,
report_to = "wandb",
output_dir = "outputs",
)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
