llama.cpp Mixed Precision Quant of DeepSeek-V3-0324
All quants made based on moxin-org/CC-MoE.
IQ1_M is based on recipes defined via the --tensor-type option.
IQ1_S is a more dynamic version intended for extreme compression.
Q2_K_L is a specialized version with only 2/4/8 bit quant designed for personalized deployment and experiments.
- IQ1_S : 137.66 GiB (1.76 BPW)
- IQ1_M : 151.25 GiB (1.94 BPW)
- Q2_K_L : 210.60 GiB (2.70 BPW)
👈 Download Guide
# !pip install huggingface_hub hf_transfer
import os
# os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "moxin-org/DeepSeek-V3-0324-Moxin-GGUF",
local_dir = "DeepSeek-V3-0324-Moxin-GGUF",
allow_patterns = ["*IQ1_M*"], # Q2_K_L, IQ1_S, Mini
)
Download Available for huggingface_hub, huggingface-cli, snapshot_download, xet.
Benchmark Comparison
| Benchmark (Metric) | llamacpp IQ1_M (140G) |
llamacpp Q2_K (230G) |
Ours IQ1_S (138G) |
Ours IQ1_M (152G) |
|---|---|---|---|---|
| Winogrande | 73.00 | 77.74 | 78.69 | 79.48 |
| MMLU (EM) | 75.45 | 85.57 | 85.42 | 86.07 |
| CMMLU | 77.06 | 82.57 | 86.65 | 87.84 |
| Hellaswag | 78.70 | 86.46 | 85.39 | 85.94 |
| gsm8k | 83.40 | 93.40 | 93.93 | 94.39 |
| BBH | 24.68 | 69.19 | 84.95 | 86.87 |
Note: All models use MoE architecture with 37B activated and 671B total parameters.
Bold values mark the best performance per benchmark.
Usage
Example of runing gguf with local build of llama.cpp. (llama-cli/llama-server)
👈 Build llama.cpp locally
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF
cmake --build build --config Release -j --clean-first
build/bin/llama-cli -m DeepSeek-V3-0324-Moxin-GGUF/V3-IQ1_M/DeepSeek-V3-0324-Moxin-IQ1_M-00001-of-00006.gguf \
-ngl 99 \
--temp 0.3 \
--min-p 0.01 \
--ctx-size 8192 \ # 4096, 16384
Smallest Compression (CC-MoE)
For our smallest compressed version 105.58 GiB (1.79 BPW). Please refer to
tflsxyy/DeepSeek-V3-0324-E192
and V3-Mini-Exp
for more details.
Citation
If this work is helpful, please kindly cite as:
@article{chen2025collaborative,
title={Collaborative Compression for Large-Scale MoE Deployment on Edge},
author={Chen, Yixiao and Xie, Yanyue and Yang, Ruining and Jiang, Wei and Wang, Wei and He, Yong and Chen, Yue and Zhao, Pu and Wang, Yanzhi},
journal={arXiv preprint arXiv:2509.25689},
year={2025}
}
Acknowledgements
This repository builds upon the outstanding work of the following open-source authors and projects:
- DeepSeek-V3.
- tflsxyy.
- ggml-org/llama.cpp, unsloth.ai, bartowski.
- ikawrakow/ik_llama.cpp, ikawrakow, ubergarm.
- EleutherAI/lm-evaluation-harness.
We sincerely thank them for their excellent contributions to the open-source community.
- Downloads last month
- 1,941
Model tree for moxin-org/DeepSeek-V3-0324-Moxin-GGUF
Base model
deepseek-ai/DeepSeek-V3-0324