---
license: apache-2.0
base_model: moonshotai/Kimi-K2-Instruct-0905
tags:
  - mlx
  - quantized
  - kimi
  - deepseek-v3
  - moe
  - instruction-following
  - 2-bit
  - apple-silicon
model_type: kimi_k2
pipeline_tag: text-generation
language:
- en
- zh
library_name: mlx
---

<div align="center">

# 🌙 Kimi K2 Instruct - MLX 2-bit

### State-of-the-Art 671B MoE Model, Optimized for Apple Silicon

[![MLX](https://img.shields.io/badge/MLX-Optimized-blue?logo=apple)](https://github.com/ml-explore/mlx)
[![Model Size](https://img.shields.io/badge/Size-320_GB-green)](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit)
[![Quantization](https://img.shields.io/badge/Quantization-2--bit-orange)](https://github.com/ml-explore/mlx)
[![Context](https://img.shields.io/badge/Context-262K_tokens-purple)](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

**[Original Model](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)** | **[MLX Framework](https://github.com/ml-explore/mlx)** | **[More Quantizations](#-other-quantization-options)**

---

</div>

## 📖 What is This?

This is an **ultra-compact 2-bit quantized version** of Kimi K2 Instruct, optimized to run on **Apple Silicon** (M1/M2/M3/M4) Macs using the MLX framework. This is the most aggressive quantization available - perfect for testing, rapid prototyping, or when you need maximum speed with minimal memory!

### ✨ Why You'll Love It

- 🚀 **Massive Context Window** - Handle up to 262,144 tokens (~200,000 words!)
- 🧠 **671B Parameters** - One of the most capable open models available
- ⚡ **Apple Silicon Native** - Fully optimized for M-series chips with Metal acceleration
- 💨 **Smallest Size** - Only ~320 GB, the most compact version available
- 🏃 **Fastest Inference** - Lightning-fast generation speeds
- 🌏 **Bilingual** - Fluent in both English and Chinese
- 💬 **Instruction-Tuned** - Ready for conversations, coding, analysis, and more

## 🎯 Quick Start

#

## Hardware Requirements

Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory:

| Quantization | Model Size | Min RAM | Quality |
|:------------:|:----------:|:-------:|:--------|
| **2-bit** | ~84 GB | 96 GB | Acceptable - some quality loss |
| **3-bit** | ~126 GB | 128 GB | Good - recommended minimum |
| **4-bit** | ~168 GB | 192 GB | Very Good - best quality/size balance |
| **5-bit** | ~210 GB | 256 GB | Excellent |
| **6-bit** | ~252 GB | 288 GB | Near original |
| **8-bit** | ~336 GB | 384 GB | Original quality |

### Recommended Configurations

| Mac Model | Max RAM | Recommended Quantization |
|:----------|:-------:|:-------------------------|
| Mac Studio M2 Ultra | 192 GB | 4-bit |
| Mac Studio M4 Ultra | 512 GB | 8-bit |
| Mac Pro M2 Ultra | 192 GB | 4-bit |
| MacBook Pro M3 Max | 128 GB | 3-bit |
| MacBook Pro M4 Max | 128 GB | 3-bit |

### Performance Notes

- **Inference Speed**: Expect ~5-15 tokens/sec depending on quantization and hardware
- **First Token Latency**: 10-30 seconds for model loading
- **Context Window**: Full 128K context supported
- **Active Parameters**: Only ~37B parameters active per token (MoE architecture)


## Installation

```bash
pip install mlx-lm
```

### Your First Generation (3 lines of code!)

```python
from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-2bit")
print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200))
```

That's it! 🎉

## 💻 System Requirements

| Component | Minimum | Recommended |
|-----------|---------|-------------|
| **Mac** | M1 or newer | M2 Ultra / M3 Max / M4 Max+ |
| **Memory** | 32 GB unified | 64 GB+ unified |
| **Storage** | 350 GB free | Fast SSD (1+ TB) |
| **macOS** | 12.0+ | Latest version |

> ⚡ **Note:** This 2-bit version is the most memory-efficient! Great for systems with limited RAM.

## 📚 Usage Examples

### Command Line Interface

```bash
mlx_lm.generate \
  --model richardyoung/Kimi-K2-Instruct-0905-MLX-2bit \
  --prompt "Write a Python script to analyze CSV files." \
  --max-tokens 500
```

### Chat Conversation

```python
from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-2bit")

conversation = """<|im_start|>system
You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|>
<|im_start|>user
Can you help me optimize this Python code?<|im_end|>
<|im_start|>assistant
"""

response = generate(model, tokenizer, prompt=conversation, max_tokens=500)
print(response)
```

### Advanced: Streaming Output

```python
from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-2bit")

for token in generate(
    model,
    tokenizer,
    prompt="Tell me about the future of AI:",
    max_tokens=500,
    stream=True
):
    print(token, end="", flush=True)
```

## 🏗️ Architecture Highlights

<details>
<summary><b>Click to expand technical details</b></summary>

### Model Specifications

| Feature | Value |
|---------|-------|
| **Total Parameters** | ~671 Billion |
| **Architecture** | DeepSeek V3 (MoE) |
| **Experts** | 384 routed + 1 shared |
| **Active Experts** | 8 per token |
| **Hidden Size** | 7168 |
| **Layers** | 61 |
| **Heads** | 56 |
| **Context Length** | 262,144 tokens |
| **Quantization** | 2-bit (ultra-aggressive) |

### Advanced Features

- **🎯 YaRN Rope Scaling** - 64x factor for extended context
- **🗜️ KV Compression** - LoRA-based (rank 512)
- **⚡ Query Compression** - Q-LoRA (rank 1536)
- **🧮 MoE Routing** - Top-8 expert selection with sigmoid scoring
- **🔧 FP8 Training** - Pre-quantized with e4m3 precision

</details>

## 🎨 Other Quantization Options

Choose the right balance for your needs:

| Quantization | Size | Quality | Speed | Best For |
|--------------|------|---------|-------|----------|
| [8-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit) | ~1 TB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Production, best quality |
| [6-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit) | ~800 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Sweet spot for most users |
| [5-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-5bit) | ~660 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Great quality/size balance |
| [4-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | ~540 GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Faster inference |
| [3-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-3bit) | ~420 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | Very fast, compact |
| **2-bit** (you are here) | ~320 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | Fastest, most compact |
| [Original](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) | ~5 TB | ⭐⭐⭐⭐⭐ | ⭐⭐ | Research only |

## 🔧 How It Was Made

This model was quantized using MLX's built-in quantization:

```bash
mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path Kimi-K2-Instruct-0905-MLX-2bit \
  -q --q-bits 2 \
  --trust-remote-code
```

**Result:** Ultra-aggressive 2-bit quantization for maximum speed and minimal storage

## ⚡ Performance Tips

<details>
<summary><b>Getting the best performance</b></summary>

1. **Close other applications** - Free up as much RAM as possible
2. **Use an external SSD** - If your internal drive is full
3. **Monitor memory** - Watch Activity Monitor during inference
4. **Adjust batch size** - If you get OOM errors, reduce max_tokens
5. **Keep your Mac cool** - Good airflow helps maintain peak performance
6. **Perfect for testing** - Use this for rapid iteration and development

</details>

## ⚠️ Known Limitations

- 🍎 **Apple Silicon Only** - Won't work on Intel Macs or NVIDIA GPUs
- 💾 **Quality Trade-off** - 2-bit quantization significantly impacts quality
- 🎯 **Best for Testing** - Not recommended for production use
- 📊 **Experimental** - May produce less accurate or coherent outputs
- 🌐 **Bilingual Focus** - Optimized for English and Chinese
- 💡 **Use Case** - Great for prototyping, testing, and resource-constrained environments

> **💡 Recommendation:** For production use, consider the 6-bit or 8-bit versions for better quality!

## 📄 License

Apache 2.0 - Same as the original model. Free for commercial use!

## 🙏 Acknowledgments

- **Original Model:** [Moonshot AI](https://www.moonshot.cn/) for creating Kimi K2
- **Framework:** Apple's [MLX team](https://github.com/ml-explore/mlx) for the amazing framework
- **Inspiration:** DeepSeek V3 architecture

## 📚 Citation

If you use this model in your research or product, please cite:

```bibtex
@misc{kimi-k2-2025,
  title={Kimi K2: Advancing Long-Context Language Models},
  author={Moonshot AI},
  year={2025},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}
```

## 🔗 Useful Links

- 📦 **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
- 🛠️ **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx)
- 📖 **MLX LM Docs:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms)
- 💬 **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit/discussions)

---

<div align="center">

**Quantized with ❤️ by [richardyoung](https://deepneuro.ai/richard)**

*If you find this useful, please ⭐ star the repo and share with others!*

**Created:** October 2025 | **Format:** MLX 2-bit

</div>