KL3M 500M Conservative, Step 100000
A 500M parameter language model trained with conservative hyperparameters using the Muon optimizer and spectral clamping. This checkpoint represents 100,000 training steps (8.32B tokens).
Model Details
- Architecture: Llama-based with Grouped Query Attention (GQA)
- Parameters: 500.3M
- Layers: 120
- Training Steps: 100,000
- Tokens Processed: 8.32B
- Precision: BF16
Training Configuration
Optimizer: Muon
- Muon LR: 0.000073 (depth-scaled)
- Aux LR: 0.00005
- Momentum: 0.95
Spectral Clamping
- Frequency: Every 100 steps
- Max condition: 2500
- Applied to: attention, MLP, and lm_head layers
Regularization
- Label Smoothing: 0.01
- Entropy Bonus: 0.005
- Activation Norm: 0.0005
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"alea-institute/kl3m-007-500m-step100000",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-007-500m-step100000")
inputs = tokenizer("The contract specifies", return_tensors="pt", return_token_type_ids=False)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
License
Apache 2.0
- Downloads last month
- 3