See axolotl config
axolotl version: 0.13.0.dev0
base_model: Qwen/Qwen2.5-72B
# Alpaca finetuning configuration for Qwen2.5-72B
# hub_model_id: username/custom_model_name
load_in_8bit: false
load_in_4bit: false
strict: false
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true
dataset_prepared_path: /home/ubuntu/cluster-axolotl/alpaca_prepared_qwen72b
seed: 42
chat_template: qwen_25
datasets:
- path: tatsu-lab/alpaca
type: alpaca
output_dir: ./outputs/qat_out_qwen72b/
sequence_len: 8096
sample_packing: true
flash_attention: true
qat:
activation_dtype: nvfp4
weight_dtype: nvfp4
group_size: 16 # only group_size of 16 is supported with nvfp4
wandb_project: qat_b200
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 16
num_epochs: 1
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-6
bf16: true
tf32: true
resume_from_checkpoint:
logging_steps: 1
save_strategy: "no"
# evals_per_epoch: 1
# saves_per_epoch: 1
warmup_ratio: 0.1
weight_decay: 0.0
fsdp_version: 2
fsdp_config:
offload_params: false
cpu_ram_efficient_loading: true
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: Qwen2DecoderLayer
state_dict_type: FULL_STATE_DICT
sharding_strategy: FULL_SHARD
reshard_after_forward: true
activation_checkpointing: true
special_tokens:
# save_first_step: true # uncomment this to validate checkpoint saving works with your config
home/ubuntu/cluster-axolotl/logs/Qwen72BAlpaca2e-6_qat
This model was trained from scratch on the tatsu-lab/alpaca dataset.
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-06
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- distributed_type: multi-GPU
- num_devices: 16
- total_train_batch_size: 256
- total_eval_batch_size: 256
- optimizer: Use adamw_torch_fused with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 1.0
Training results
Framework versions
- Transformers 4.56.1
- Pytorch 2.8.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.1
- Downloads last month
- 8