Kimi-K2.5-PRISM-REAP-530B-A32B
50% REAP expert-pruned version of moonshotai/Kimi-K2.5, built from the PRISM variant.
| Property | Value |
|---|---|
| Architecture | KimiK25 (DeepSeekV3 backbone) |
| Total Parameters | ~530B (down from ~1T) |
| Active Parameters | ~32B (8 experts per token) |
| Experts per MoE Layer | 192 routed + 1 shared (down from 384 + 1) |
| MoE Layers | 60 (layers 1-60, layer 0 is dense) |
| Quantization | INT4 (group_size=32, symmetric) via compressed-tensors |
| Disk Size | 289 GB (down from 555 GB) |
| Pruning Method | REAP (Router-weighted Expert Activation Pruning) |
| Calibration | 512 samples from allenai/tulu-3-sft-mixture, max 2800 tokens |
What is REAP?
REAP (Cerebras Research, 2025) is a one-shot expert pruning method for Mixture-of-Experts models. It computes saliency scores using the router-weighted expert output norms from real forward passes:
S_j = (1 / |X_j|) * SUM_{x in X_j} [ g_j(x) * ||f_j(x)||_2 ]
Where g_j(x) is the normalized gate weight and ||f_j(x)||_2 is the L2 norm of expert j's output for token x. Experts with the lowest saliency are pruned.
What is PRISM?
This model was first treated using our SOTA PRISM-LITE pipeline, softening over-refusal and bias behaviors while preserving model quality. The REAP pruning was then applied on top of the PRISM model.
Key Technical Details
- Uniform 50% pruning: Every MoE layer pruned from 384 to 192 experts
- Super expert preservation: Top 0.5th percentile experts (by activation norm) were guaranteed to survive
- Zero-redundancy observer: Saliency computed from real forward pass hooks
- torch.compile fused INT4 GEMM: Custom compiled kernel for fast INT4 decompression during calibration
- Correct saliency ordering verified: In every layer,
min_retained_saliency > max_pruned_saliency
Hardware Requirements
This model is 289 GB in INT4 format. You need:
| Setup | VRAM | Fits? |
|---|---|---|
| 8x H200 141GB | 1,128 GB | Yes (used for calibration) |
| 8x H100 80GB | 640 GB | Yes |
| 8x A100 80GB | 640 GB | Yes |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", thinking=False)
inputs = inputs.to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Re-pruning at Different Ratios
The calibration saliency scores are included in the calibration/ directory. You can re-prune at a higher compression ratio without re-running the expensive calibration forward pass:
# Clone this repo's REAP source
git clone https://huggingface.co/Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B
cd Kimi-K2.5-PRISM-REAP-530B-A32B
# Re-prune at 65% (384 -> 134 experts, ~208 GB)
python3 reap/src/kimi_reap.py \
--model moonshotai/Kimi-K2.5 \
--load_scores calibration/reap_scores_v9_512samples.pt \
--compression_ratio 0.65 \
--save_model \
--output_dir ./Kimi-K2.5-PRISM-REAP-65pct
# Re-prune at 75% (384 -> 96 experts, ~155 GB)
python3 reap/src/kimi_reap.py \
--model moonshotai/Kimi-K2.5 \
--load_scores calibration/reap_scores_v9_512samples.pt \
--compression_ratio 0.75 \
--save_model \
--output_dir ./Kimi-K2.5-PRISM-REAP-75pct
Calibration Details
| Parameter | Value |
|---|---|
| Dataset | allenai/tulu-3-sft-mixture |
| Max sequence length | 2800 tokens |
| Seed | 42 |
| Calibration time | 72.6 minutes (8x H200) |
| Pruning time | 7.3 seconds |
| Save time | 5.7 minutes |
File Structure
.
├── model-00001-of-00031.safetensors # Model shards (289 GB total)
├── ...
├── model-00031-of-00031.safetensors
├── model.safetensors.index.json
├── config.json # Updated: n_routed_experts=192
├── tokenizer_config.json
├── generation_config.json
├── calibration/
│ ├── reap_scores_v9_512samples.pt # Saliency scores (reusable for re-pruning)
│ ├── reap_accumulator_checkpoint.pt # Raw accumulators (for extending calibration)
│ ├── reap_pruning_metadata.json # Full pruning metadata per layer
└── reap/
├── src/
│ ├── kimi_reap.py # Main entry point (with all compatibility shims)
│ ├── observer.py # REAP saliency observer hooks
│ └── data.py # Calibration dataset loading
└── scripts/
├── bench_int4.py # INT4 GEMM benchmarks
└── bench_int4_v2.py # torch.compile benchmark
Compatibility Shims
Loading Kimi-K2.5 with compressed-tensors requires several monkey-patches (all included in reap/src/kimi_reap.py):
| Shim | Purpose |
|---|---|
| Shim 0 | _initialize_weights guard — prevents _init_weights from overwriting loaded weights |
| Shim 1 | is_torch_fx_available stub — removed in transformers 5.x |
| Shim 2a | compress_model fast path — skip 69,120 meta modules (111 min to <1s) |
| Shim 2b | Quantizer ignore list — language_model. prefix fix |
| Shim 2c | register_offload_parameter safety |
| Shim 2d+2e+2g | Fused compiled INT4 forward — torch.compile decompress+matmul |
Citation
@article{reap2025,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Cerebras Research},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
Acknowledgments
- moonshotai/Kimi-K2.5 — base model
- Cerebras REAP — pruning method
- PRISM — PRISM Over-Bias Removal
- Downloads last month
- 730