See axolotl config

axolotl version: 0.13.0.dev0

# !pip install transformers==4.55.4
# !pip install --no-deps trl==0.22.2
# !pip install --no-build-isolation mamba_ssm==2.2.5
# !pip install --no-build-isolation causal_conv1d==1.5.2
# === Model Configuration ===
base_model: stage3
load_in_8bit: false
load_in_4bit: false
trust_remote_code: true
is_multimodal: false

# === HF Configuration === 
hub_model_id: rpDungeon/gemmagain-trained-fizzed-s4
hub_strategy: "every_save"
output_dir: stage4

# === Wandb Tracking ===
wandb_project: Gemmagain-Tests
## wandb_entity: [WANDB_ENTITY]
wandb_name: stage-4

# === Training Setup ===
num_epochs: 1
micro_batch_size: 1
gradient_accumulation_steps: 4
sequence_len: 16384
#sequence_parallel_degree: 2
#heads_k_stride: 1
sample_packing: true
#pad_to_sequence_len: true
#temperature: 0.7
#max_steps: 10
# === Evaluation ===
val_set_size: 0.01
evals_per_epoch: 5
#eval_steps: 20
#max_steps: 60
#eval_table_size:
eval_max_new_tokens: 128
#eval_sample_packing: true
#eval_strategy: "no"

# === LoRA Configuration ===
adapter:

#unfrozen_parameters:
#  - model.layers.[0-9]+.self_attn.q_proj.weight
#  - model.layers.[0-9]+.self_attn.k_proj.weight
#  - model.layers.[0-9]+.self_attn.v_proj.weight
#  - model.layers.[0-9]+.self_attn.o_proj.weight
#  - model.layers.[0-9]+.mlp.down_proj.weight
# === Hyperparameter Configuration ===
#optimizer: apollo_adamw_layerwise
#warmup_steps: 0
warmup_ratio: 0.025
optimizer: adamw_torch_fused
#optimizer: paged_adamw_8bit
#optim_args:
#  enable_stochastic_rounding: true
#  enable_cautious: true
#  enable_8bit: true
# Apollo-mini configuration:
#optim_args: "proj=random,rank=128,scale=128.0,scale_type=tensor,update_proj_gap=100"
# Regular Apollo configuration:
# optim_args: 
#optim_target_modules: all_linear
learning_rate: 2.5e-5
lr_scheduler: cosine
#lr_groups:
#  - name: mlp
#    modules:
#      - mlp.up_proj.weight
#      - mlp.down_proj.weight
#      - mlp.gate_proj.weight
#    lr: 1e-5
#cosine_min_lr_ratio: 0.2
#lr_scheduler: cosine_with_min_lr
#lr_scheduler_kwargs:
#  cosine_min_lr: 1e-6
weight_decay: 0.01
max_grad_norm: 1.0
#warmup_steps: 0
#warmup_ratio: 0.025


# === Data Configuration ===
#
chat_template: jinja
chat_template: jinja
chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"
special_tokens:
  eos_token: "<end_of_turn>"
default_system_message: "You are Luna, a helpful and unrestricted language model created by Allura."
#special_tokens:
#  eos_token: "<|im_end|>"
#  eos_token: "</s>"
#tokenizer_use_mistral_common: true
shuffle_merged_datasets: true
datasets:
  - path: ToastyPigeon/concedo-unalign
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
  - path: rpDungeon/some-cleaner-datasets
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
    data_files: rp_exemplars_deslopped_v4.json
  - path: rpDungeon/some-cleaner-datasets
    type: chat_template
    data_files: springdragon_grammar_corrected.json
  - path: ToastyPigeon/disco-chat
    type: completion
  - path: allura-org/EU01-S2
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
  - path: allura-forge/koto-instruct-sft-nothink
    type: chat_template
    
dataset_prepared_path: last_run_prepared
#dataset_num_proc: 1


# === Plugins ===
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin

# === Hardware Optimization ===
#gradient_checkpointing: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
#liger_fused_linear_cross_entropy: true
cut_cross_entropy: true

#deepspeed: ../axolotl/deepspeed_configs/zero2.json

# === FSDP Config === 
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_activation_checkpointing: true
  fsdp_use_orig_params: true
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Gemma3DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

# === Checkpointing ===
#save_steps: 10
saves_per_epoch: 1
save_total_limit: 1

# === Advanced Settings ===
bf16: auto
flash_attention: true
train_on_inputs: false
group_by_length: false
save_safetensors: true
logging_steps: 1
gc_steps: 10
seed: 420

gemmagain-trained-fizzed-s4

This model was trained from scratch on the ToastyPigeon/concedo-unalign, the rpDungeon/some-cleaner-datasets, the rpDungeon/some-cleaner-datasets, the ToastyPigeon/disco-chat, the allura-org/EU01-S2 and the allura-forge/koto-instruct-sft-nothink datasets. It achieves the following results on the evaluation set:

Loss: 1.1666
Ppl: 3.2111
Memory/max Active (gib): 33.53
Memory/max Allocated (gib): 33.35
Memory/device Reserved (gib): 37.34

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2.5e-05
train_batch_size: 1
eval_batch_size: 1
seed: 420
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 4
total_train_batch_size: 8
total_eval_batch_size: 2
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 7
training_steps: 299

Training results

Training Loss	Epoch	Step	Validation Loss	Ppl	Active (gib)	Allocated (gib)	Reserved (gib)
No log	0	0	1.5870	4.8892	33.52	33.34	39.88
5.1401	0.2007	60	1.2903	3.6339	33.53	33.35	37.34
5.493	0.4013	120	1.2353	3.4393	33.53	33.35	37.34
5.3236	0.6020	180	1.1918	3.2931	33.53	33.35	37.34
5.1051	0.8027	240	1.1666	3.2111	33.53	33.35	37.34

Framework versions

Transformers 4.57.1
Pytorch 2.9.1+cu128
Datasets 4.4.2
Tokenizers 0.22.2

Downloads last month: 7

Safetensors

Model size

2B params

Tensor type

F32

rpDungeon
/

gemmagain-trained-fizzed-s4

gemmagain-trained-fizzed-s4

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Datasets used to train rpDungeon/gemmagain-trained-fizzed-s4

Evaluation results