KL3M 500M Conservative, Step 100000

A 500M parameter language model trained with conservative hyperparameters using the Muon optimizer and spectral clamping. This checkpoint represents 100,000 training steps (8.32B tokens).

Model Details

  • Architecture: Llama-based with Grouped Query Attention (GQA)
  • Parameters: 500.3M
  • Layers: 120
  • Training Steps: 100,000
  • Tokens Processed: 8.32B
  • Precision: BF16

Training Configuration

Optimizer: Muon

  • Muon LR: 0.000073 (depth-scaled)
  • Aux LR: 0.00005
  • Momentum: 0.95

Spectral Clamping

  • Frequency: Every 100 steps
  • Max condition: 2500
  • Applied to: attention, MLP, and lm_head layers

Regularization

  • Label Smoothing: 0.01
  • Entropy Bonus: 0.005
  • Activation Norm: 0.0005

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "alea-institute/kl3m-007-500m-step100000",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-007-500m-step100000")

inputs = tokenizer("The contract specifies", return_tensors="pt", return_token_type_ids=False)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License

Apache 2.0

Downloads last month
3
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support