Mini GLM-4 MoE (0.5B)

A small GLM-4 MoE model (543M parameters) for testing and development. Uses the same Glm4MoeForCausalLM architecture as the full GLM-4-100B-A10B but with reduced dimensions.

This model is designed for testing MoE training pipelines in prime-rl without needing large pretrained checkpoints. It is small enough to run on a single GPU while exercising the same code paths as production models.

Architecture

Parameter Value
Parameters 543M
Hidden size 1024
Layers 24
Attention heads 16 (4 KV heads)
Routed experts 8
Experts per token 4
Shared experts 1
MoE intermediate size 256
Dense intermediate size 2048
Dense layers (first-k) 1
Vocab size 151,552
Partial rotary factor 0.5
Model type glm4_moe

The architecture mirrors THUDM/GLM-4-100B-A10B: the first layer is a dense MLP, and all subsequent layers use Mixture-of-Experts with a shared expert. Attention uses Grouped Query Attention (GQA) with partial rotary embeddings.

How this model was created

Step 1: Random initialization. A Glm4MoeConfig was instantiated with the small dimensions above and the HuggingFace Glm4MoeForCausalLM model was initialized with random weights. The tokenizer was copied from THUDM/GLM-4-9B-0414.

Step 2: Roundtrip verification. Before training, we verified that the HuggingFace and prime-rl custom implementations produce identical outputs on the same weights (max logits diff < 0.01), and that convert_to_hf / convert_to_prime state dict conversions are lossless.

Step 3: SFT warmup. The model was fine-tuned for 200 steps on PrimeIntellect/Reverse-Text-SFT using prime-rl's custom MoE implementation with the following config:

max_steps = 200

[model]
impl = "custom"
attn = "sdpa"

[data]
name = "PrimeIntellect/Reverse-Text-SFT"
batch_size = 4
seq_len = 1024

[optim]
lr = 1e-4

Loss went from ~12 (random init) to ~2.5 after 200 steps. The model is not intended to be useful for generation -- the SFT warmup gives it a non-trivial learned distribution so that KL divergence and other RL metrics are meaningful during testing.

Step 4: Post-training verification. After SFT, we re-verified the HF <-> PrimeRL roundtrip on the trained checkpoint to confirm that checkpoint saving (which goes through convert_to_hf) produced valid weights.

Reproduction

The scripts used to create this model live in the prime-rl repository under scripts/mini_moe/:

# Step 1: Create random-init model
uv run python scripts/mini_moe/create.py --arch glm4_moe --output-dir ./mini-glm-moe

# Step 2: Verify HF <-> PrimeRL roundtrip
uv run python scripts/mini_moe/verify.py --arch glm4_moe --model-dir ./mini-glm-moe

# Step 3: SFT warmup + verify + push
uv run python scripts/mini_moe/sft_warmup.py --arch glm4_moe --model-dir ./mini-glm-moe --sft-steps 200 --push-to-hub samsja/mini-glm-moe

To add a new architecture, add a preset to scripts/mini_moe/presets.py.

Intended use

  • Testing MoE training pipelines (SFT, RL) in prime-rl
  • Validating state dict conversion between HuggingFace and prime-rl formats
  • Integration tests that need a real MoE model but cannot afford large checkpoints
  • Checking RL metrics (KL divergence, reward signals) on a small scale

This model is not intended for inference or any downstream task.

Downloads last month
-
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support