Parameters Architecture Context Multimodal

# Kimi-K2.5-PRISM

An unrestricted/unchained PRISM version of Moonshot AI's Kimi-K2.5 with over-refusal and propaganda mechanisms removed using our advanced PRISM pipeline (Projected Refusal Isolation via Subspace Modification).

β˜• Support Our Work

If you enjou our work and find it useful, please consider sponsoring or supporting us!

Ko-fi

Option Description
PRISM VIP Membership Access to all PRISM models
One-Time Support Support this model

Model Highlights

  • PRISM Ablation β€” State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
  • 1T MoE Architecture β€” 1 trillion total parameters with 32 billion active per token across 384 experts
  • Native Multimodal β€” Pre-trained on vision-language tokens for seamless image, video, and text understanding
  • 256K Context Window β€” Extended context for complex agentic tasks and large codebases
  • Dual Modes β€” Supports both Thinking (deep reasoning) and Instant (fast response) modes
  • Agent Swarm β€” Self-directed, coordinated multi-agent execution for complex tasks

Model Architecture

Specification Value
Architecture Mixture-of-Experts (MoE)
Total Parameters 1T
Activated Parameters 32B
Number of Layers 61
Attention Hidden Dimension 7168
Number of Attention Heads 64
Number of Experts 384
Selected Experts per Token 8
Shared Experts 1
Vocabulary Size 160K
Context Length 256K
Attention Mechanism MLA
Activation Function SwiGLU
Vision Encoder MoonViT (400M)

Benchmarks

Benchmark Kimi K2.5 (Thinking) GPT-5.2 Claude 4.5 Opus Gemini 3 Pro
AIME 2025 96.1 100 92.8 95.0
GPQA-Diamond 87.6 92.4 87.0 91.9
HLE-Full 30.1 34.5 30.8 37.5
HLE-Full (w/ tools) 50.2 45.5 43.2 45.8
SWE-Bench Verified 76.8 80.0 80.9 76.2
Terminal Bench 2.0 50.8 54.0 59.3 54.2
BrowseComp 60.6 65.8 37.0 37.8
MMMU-Pro 78.5 79.5 74.0 81.0
VideoMMMU 86.6 85.9 84.4 87.6

Usage

Transformers

Install dependencies:

pip install git+https://github.com/huggingface/transformers.git

Basic chat completion:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "Ex0bit/Kimi-K2.5-PRISM"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are Kimi, an AI assistant."},
    {"role": "user", "content": "Hello!"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=True, temperature=1.0, top_p=0.95)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(output_text)

Chat with Image

import base64
import requests

# Load image
url = "https://example.com/image.png"
image_base64 = base64.b64encode(requests.get(url).content).decode()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image in detail."},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{image_base64}"},
            },
        ],
    }
]

# Use same generation code as above

vLLM

Install vLLM nightly:

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git

Serve the model:

vllm serve Ex0bit/Kimi-K2.5-PRISM \
     --tensor-parallel-size 8 \
     --trust-remote-code \
     --served-model-name kimi-k2.5-prism

SGLang

python3 -m sglang.launch_server \
  --model-path Ex0bit/Kimi-K2.5-PRISM \
  --tp-size 8 \
  --trust-remote-code \
  --served-model-name kimi-k2.5-prism \
  --host 0.0.0.0 \
  --port 8000

Recommended Parameters

Mode Temperature Top-P Max New Tokens
Thinking 1.0 0.95 96000
Instant 0.6 0.95 4096

Switching Modes

For Instant mode (faster, no reasoning), pass:

# Official API
extra_body={"thinking": {"type": "disabled"}}

# vLLM/SGLang
extra_body={"chat_template_kwargs": {"thinking": False}}

Hardware Requirements

Due to the 1T parameter size, this model requires significant hardware:

  • Minimum: 8x A100 80GB or equivalent
  • Recommended: 8x H100 80GB for optimal performance
  • INT4 Quantization: Available for reduced memory footprint

License

This model is released under the PRISM Research License.

Acknowledgments

Based on Kimi-K2.5 by Moonshot AI. See the technical blog for more details on the base model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support