ubergarm
/

Kimi-K2-Thinking-GGUF

@@ -14,7 +14,7 @@ tags:
 ---
 ## ~~imatrix~~ Quantization of moonshotai/Kimi-K2-Instruct-0905
-Converted with mainline llama.cpp and quantized with ik_llama.cpp and the one quant so far is testing working basic inference on both forks.
 ## Big Thanks
 Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)!  **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
@@ -27,11 +27,14 @@ Finally, I appreciate all the support from [aifoundry.org](https://aifoundry.org
 This is an interesting one, currently only one available given the the original model design. The `Q8_0-Q4_0` is `q4_0` routed experts and `q8_0` all other tensors. It works on both ik_llama.cpp and mainline llama.cpp in limited testing. It does *not* use an imatrix!
 Compare with baseline perplexity of full size `Q8_0-Q4_0` 543.617 GiB (4.549 BPW)
 Final estimate: PPL = TODO
 ## References
 * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
 * [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
 * [ubergarm-imatrix-calibration-corpus-v02.txt](https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a?permalink_comment_id=5682584#gistcomment-5682584)
 * [llama.cpp PR#17069](https://github.com/ggml-org/llama.cpp/pull/17069#issuecomment-3500870165)

 ---
 ## ~~imatrix~~ Quantization of moonshotai/Kimi-K2-Instruct-0905
+Converted with mainline llama.cpp PR#17069 and quantized with ik_llama.cpp. The one quant is working running inference on both forks in limited testing.
 ## Big Thanks
 Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)!  **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
 This is an interesting one, currently only one available given the the original model design. The `Q8_0-Q4_0` is `q4_0` routed experts and `q8_0` all other tensors. It works on both ik_llama.cpp and mainline llama.cpp in limited testing. It does *not* use an imatrix!
 Compare with baseline perplexity of full size `Q8_0-Q4_0` 543.617 GiB (4.549 BPW)
 Final estimate: PPL = TODO
+I may try to make a smaller one e.g. `smol-IQ1_KT` or `smol-IQ2_KS` or similar but not sure how well it will go given the original is QAT'd with `compressed-tensors` to *very similar* to q4_0 except using bf16 block scales instead of fp16 but same 32 weights per block size.
 ## References
 * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
 * [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
 * [ubergarm-imatrix-calibration-corpus-v02.txt](https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a?permalink_comment_id=5682584#gistcomment-5682584)
+* [moonshotai/Kimi-K2-Thinking/discussions/2](https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/2)
+* [vllm-project/compressed-tensors/issues/511](https://github.com/vllm-project/compressed-tensors/issues/511)
 * [llama.cpp PR#17069](https://github.com/ggml-org/llama.cpp/pull/17069#issuecomment-3500870165)