An unrestricted/unchained PRISM version of Moonshot AI's Kimi-K2.5 with over-refusal and propaganda mechanisms removed using our advanced PRISM pipeline (Projected Refusal Isolation via Subspace Modification).
β Support Our Work
If you enjou our work and find it useful, please consider sponsoring or supporting us!
| Option | Description |
|---|---|
| PRISM VIP Membership | Access to all PRISM models |
| One-Time Support | Support this model |
Model Highlights
- PRISM Ablation β State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
- 1T MoE Architecture β 1 trillion total parameters with 32 billion active per token across 384 experts
- Native Multimodal β Pre-trained on vision-language tokens for seamless image, video, and text understanding
- 256K Context Window β Extended context for complex agentic tasks and large codebases
- Dual Modes β Supports both Thinking (deep reasoning) and Instant (fast response) modes
- Agent Swarm β Self-directed, coordinated multi-agent execution for complex tasks
Model Architecture
| Specification | Value |
|---|---|
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Number of Layers | 61 |
| Attention Hidden Dimension | 7168 |
| Number of Attention Heads | 64 |
| Number of Experts | 384 |
| Selected Experts per Token | 8 |
| Shared Experts | 1 |
| Vocabulary Size | 160K |
| Context Length | 256K |
| Attention Mechanism | MLA |
| Activation Function | SwiGLU |
| Vision Encoder | MoonViT (400M) |
Benchmarks
| Benchmark | Kimi K2.5 (Thinking) | GPT-5.2 | Claude 4.5 Opus | Gemini 3 Pro |
|---|---|---|---|---|
| AIME 2025 | 96.1 | 100 | 92.8 | 95.0 |
| GPQA-Diamond | 87.6 | 92.4 | 87.0 | 91.9 |
| HLE-Full | 30.1 | 34.5 | 30.8 | 37.5 |
| HLE-Full (w/ tools) | 50.2 | 45.5 | 43.2 | 45.8 |
| SWE-Bench Verified | 76.8 | 80.0 | 80.9 | 76.2 |
| Terminal Bench 2.0 | 50.8 | 54.0 | 59.3 | 54.2 |
| BrowseComp | 60.6 | 65.8 | 37.0 | 37.8 |
| MMMU-Pro | 78.5 | 79.5 | 74.0 | 81.0 |
| VideoMMMU | 86.6 | 85.9 | 84.4 | 87.6 |
Usage
Transformers
Install dependencies:
pip install git+https://github.com/huggingface/transformers.git
Basic chat completion:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "Ex0bit/Kimi-K2.5-PRISM"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "system", "content": "You are Kimi, an AI assistant."},
{"role": "user", "content": "Hello!"}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=True, temperature=1.0, top_p=0.95)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(output_text)
Chat with Image
import base64
import requests
# Load image
url = "https://example.com/image.png"
image_base64 = base64.b64encode(requests.get(url).content).decode()
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail."},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_base64}"},
},
],
}
]
# Use same generation code as above
vLLM
Install vLLM nightly:
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git
Serve the model:
vllm serve Ex0bit/Kimi-K2.5-PRISM \
--tensor-parallel-size 8 \
--trust-remote-code \
--served-model-name kimi-k2.5-prism
SGLang
python3 -m sglang.launch_server \
--model-path Ex0bit/Kimi-K2.5-PRISM \
--tp-size 8 \
--trust-remote-code \
--served-model-name kimi-k2.5-prism \
--host 0.0.0.0 \
--port 8000
Recommended Parameters
| Mode | Temperature | Top-P | Max New Tokens |
|---|---|---|---|
| Thinking | 1.0 | 0.95 | 96000 |
| Instant | 0.6 | 0.95 | 4096 |
Switching Modes
For Instant mode (faster, no reasoning), pass:
# Official API
extra_body={"thinking": {"type": "disabled"}}
# vLLM/SGLang
extra_body={"chat_template_kwargs": {"thinking": False}}
Hardware Requirements
Due to the 1T parameter size, this model requires significant hardware:
- Minimum: 8x A100 80GB or equivalent
- Recommended: 8x H100 80GB for optimal performance
- INT4 Quantization: Available for reduced memory footprint
License
This model is released under the PRISM Research License.
Acknowledgments
Based on Kimi-K2.5 by Moonshot AI. See the technical blog for more details on the base model.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support