Mini GLM-4 MoE (0.5B)
A small GLM-4 MoE model (543M parameters) for testing and development. Uses the same Glm4MoeForCausalLM architecture as the full GLM-4-100B-A10B but with reduced dimensions.
This model is designed for testing MoE training pipelines in prime-rl without needing large pretrained checkpoints. It is small enough to run on a single GPU while exercising the same code paths as production models.
Architecture
| Parameter | Value |
|---|---|
| Parameters | 543M |
| Hidden size | 1024 |
| Layers | 24 |
| Attention heads | 16 (4 KV heads) |
| Routed experts | 8 |
| Experts per token | 4 |
| Shared experts | 1 |
| MoE intermediate size | 256 |
| Dense intermediate size | 2048 |
| Dense layers (first-k) | 1 |
| Vocab size | 151,552 |
| Partial rotary factor | 0.5 |
| Model type | glm4_moe |
The architecture mirrors THUDM/GLM-4-100B-A10B: the first layer is a dense MLP, and all subsequent layers use Mixture-of-Experts with a shared expert. Attention uses Grouped Query Attention (GQA) with partial rotary embeddings.
How this model was created
Step 1: Random initialization. A Glm4MoeConfig was instantiated with the small dimensions above and the HuggingFace Glm4MoeForCausalLM model was initialized with random weights. The tokenizer was copied from THUDM/GLM-4-9B-0414.
Step 2: Roundtrip verification. Before training, we verified that the HuggingFace and prime-rl custom implementations produce identical outputs on the same weights (max logits diff < 0.01), and that convert_to_hf / convert_to_prime state dict conversions are lossless.
Step 3: SFT warmup. The model was fine-tuned for 200 steps on PrimeIntellect/Reverse-Text-SFT using prime-rl's custom MoE implementation with the following config:
max_steps = 200
[model]
impl = "custom"
attn = "sdpa"
[data]
name = "PrimeIntellect/Reverse-Text-SFT"
batch_size = 4
seq_len = 1024
[optim]
lr = 1e-4
Loss went from ~12 (random init) to ~2.5 after 200 steps. The model is not intended to be useful for generation -- the SFT warmup gives it a non-trivial learned distribution so that KL divergence and other RL metrics are meaningful during testing.
Step 4: Post-training verification. After SFT, we re-verified the HF <-> PrimeRL roundtrip on the trained checkpoint to confirm that checkpoint saving (which goes through convert_to_hf) produced valid weights.
Reproduction
The scripts used to create this model live in the prime-rl repository under scripts/mini_moe/:
# Step 1: Create random-init model
uv run python scripts/mini_moe/create.py --arch glm4_moe --output-dir ./mini-glm-moe
# Step 2: Verify HF <-> PrimeRL roundtrip
uv run python scripts/mini_moe/verify.py --arch glm4_moe --model-dir ./mini-glm-moe
# Step 3: SFT warmup + verify + push
uv run python scripts/mini_moe/sft_warmup.py --arch glm4_moe --model-dir ./mini-glm-moe --sft-steps 200 --push-to-hub samsja/mini-glm-moe
To add a new architecture, add a preset to scripts/mini_moe/presets.py.
Intended use
- Testing MoE training pipelines (SFT, RL) in prime-rl
- Validating state dict conversion between HuggingFace and prime-rl formats
- Integration tests that need a real MoE model but cannot afford large checkpoints
- Checking RL metrics (KL divergence, reward signals) on a small scale
This model is not intended for inference or any downstream task.
- Downloads last month
- -