Kimi-K2.5-PRISM-REAP-530B-A32B

50% REAP expert-pruned version of moonshotai/Kimi-K2.5, built from the PRISM variant.

Property	Value
Architecture	KimiK25 (DeepSeekV3 backbone)
Total Parameters	~530B (down from ~1T)
Active Parameters	~32B (8 experts per token)
Experts per MoE Layer	192 routed + 1 shared (down from 384 + 1)
MoE Layers	60 (layers 1-60, layer 0 is dense)
Quantization	INT4 (group_size=32, symmetric) via compressed-tensors
Disk Size	289 GB (down from 555 GB)
Pruning Method	REAP (Router-weighted Expert Activation Pruning)
Calibration	512 samples from allenai/tulu-3-sft-mixture, max 2800 tokens

What is REAP?

REAP (Cerebras Research, 2025) is a one-shot expert pruning method for Mixture-of-Experts models. It computes saliency scores using the router-weighted expert output norms from real forward passes:

S_j = (1 / |X_j|) * SUM_{x in X_j} [ g_j(x) * ||f_j(x)||_2 ]

Where g_j(x) is the normalized gate weight and ||f_j(x)||_2 is the L2 norm of expert j's output for token x. Experts with the lowest saliency are pruned.

What is PRISM?

This model was first treated using our SOTA PRISM-LITE pipeline, softening over-refusal and bias behaviors while preserving model quality. The REAP pruning was then applied on top of the PRISM model.

Key Technical Details

Uniform 50% pruning: Every MoE layer pruned from 384 to 192 experts
Super expert preservation: Top 0.5th percentile experts (by activation norm) were guaranteed to survive
Zero-redundancy observer: Saliency computed from real forward pass hooks
torch.compile fused INT4 GEMM: Custom compiled kernel for fast INT4 decompression during calibration
Correct saliency ordering verified: In every layer, min_retained_saliency > max_pruned_saliency

Hardware Requirements

This model is 289 GB in INT4 format. You need:

Setup	VRAM	Fits?
8x H200 141GB	1,128 GB	Yes (used for calibration)
8x H100 80GB	640 GB	Yes
8x A100 80GB	640 GB	Yes

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", thinking=False)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Re-pruning at Different Ratios

The calibration saliency scores are included in the calibration/ directory. You can re-prune at a higher compression ratio without re-running the expensive calibration forward pass:

# Clone this repo's REAP source
git clone https://huggingface.co/Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B
cd Kimi-K2.5-PRISM-REAP-530B-A32B

# Re-prune at 65% (384 -> 134 experts, ~208 GB)
python3 reap/src/kimi_reap.py \
  --model moonshotai/Kimi-K2.5 \
  --load_scores calibration/reap_scores_v9_512samples.pt \
  --compression_ratio 0.65 \
  --save_model \
  --output_dir ./Kimi-K2.5-PRISM-REAP-65pct

# Re-prune at 75% (384 -> 96 experts, ~155 GB)
python3 reap/src/kimi_reap.py \
  --model moonshotai/Kimi-K2.5 \
  --load_scores calibration/reap_scores_v9_512samples.pt \
  --compression_ratio 0.75 \
  --save_model \
  --output_dir ./Kimi-K2.5-PRISM-REAP-75pct

Calibration Details

Parameter	Value
Dataset	allenai/tulu-3-sft-mixture
Max sequence length	2800 tokens
Seed	42
Calibration time	72.6 minutes (8x H200)
Pruning time	7.3 seconds
Save time	5.7 minutes

File Structure

.
├── model-00001-of-00031.safetensors  # Model shards (289 GB total)
├── ...
├── model-00031-of-00031.safetensors
├── model.safetensors.index.json
├── config.json                        # Updated: n_routed_experts=192
├── tokenizer_config.json
├── generation_config.json
├── calibration/
│   ├── reap_scores_v9_512samples.pt   # Saliency scores (reusable for re-pruning)
│   ├── reap_accumulator_checkpoint.pt # Raw accumulators (for extending calibration)
│   ├── reap_pruning_metadata.json     # Full pruning metadata per layer
└── reap/
    ├── src/
    │   ├── kimi_reap.py               # Main entry point (with all compatibility shims)
    │   ├── observer.py                # REAP saliency observer hooks
    │   └── data.py                    # Calibration dataset loading
    └── scripts/
        ├── bench_int4.py              # INT4 GEMM benchmarks
        └── bench_int4_v2.py           # torch.compile benchmark

Compatibility Shims

Loading Kimi-K2.5 with compressed-tensors requires several monkey-patches (all included in reap/src/kimi_reap.py):

Shim	Purpose
Shim 0	`_initialize_weights` guard — prevents `_init_weights` from overwriting loaded weights
Shim 1	`is_torch_fx_available` stub — removed in transformers 5.x
Shim 2a	`compress_model` fast path — skip 69,120 meta modules (111 min to <1s)
Shim 2b	Quantizer ignore list — `language_model.` prefix fix
Shim 2c	`register_offload_parameter` safety
Shim 2d+2e+2g	Fused compiled INT4 forward — torch.compile decompress+matmul

Citation

@article{reap2025,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Cerebras Research},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Acknowledgments

moonshotai/Kimi-K2.5 — base model
Cerebras REAP — pruning method
PRISM — PRISM Over-Bias Removal

Downloads last month: 730

Safetensors

Model size

91B params

Tensor type

I32

BF16

Model tree for Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B

Base model

moonshotai/Kimi-K2.5

Quantized

(23)

this model

Quantizations

1 model

Paper for Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 14