Imatrix from BF16 ?

by bobchenyx - opened 16 days ago

Discussion

bobchenyx

16 days ago

Hi there, truly appreciate this open-source sharing — you’re my hero!

I would like ask a bit regarding the imatrix file provided.
imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat

Is this based on the Q8_0/Q4_0 or is this calibrated by a BF16 version like unsloth?

Would it be compatible with making personalized quantization from BF16?

Thireus

15 days ago

•

edited 15 days ago

From what I understand full BF16 is not possible yet due to limitations explained here: https://github.com/ggml-org/llama.cpp/pull/17069#issuecomment-3502582056
@ubergarm , please correct me if I'm wrong.

ubergarm

Owner 13 days ago

@bobchenyx @Thireus

Would it be compatible with making personalized quantization from BF16?

Absolutely! Yes you can use the provided imatrix dat file to make your own custom quants. It has importance information for all the tensors so you can mix recipes as you please.

Is this based on the Q8_0/Q4_0

Correct, I used PR17069 to create a full bf16 from the original moonshotai safetensors. Then I converted that without imatrix to q8_0-q4_0 approximating the original QAT design. To be clear it would take a single machine with 2TB RAM to inference the full bf16 directly so it is common to create a non-imatrix q8_0 or smaller in some cases to collect the imatrix data. Finally, I used that imatrix dat to create the rest of the quants.

From what I understand full BF16 is not possible yet due to limitations

Right, my impression is folks are still exploring this. I have a "full bf16" and so does unsloth and @DevQuasar has another method. To be clear no method can perfectly recreate the original compressed-tensors quantization configuration given GGUF q4_0 uses float16 scalar and not bfloat16 used by the original.

So I don't think any of the methods are completely lossless. They are all likely pretty close with some small numerical/rounding/quantization lattice type of slight differences. But is it significant enough to change the output greatly from the original model its unclear.

Some folks on Beaver AI discord are talking about trying to use vLLM to measure the differences in the original safetensors vs various GGUFs but its not easy to do especially on these large models.

In my own benchmarking shown in the graph on the model card, the Q8_0 is doing better which raises some questsions. This suggests a few possibilities to me:

the specific compressed-tensors format quantization targeted during the QAT is not translating well into llama.cpp's Q4_0
moonshot really didn't post-train QAT it long enough to matter so just throw more bits/better quants at it and ignore the QAT fact
round-tripping from original quantized safetensors -> BF16 GGUF -> quantized GGUF isn't letting the Q4_0 map correctly
i'm not 100% sure how the activations were handled during QATing and i think activations are ~8.5bpw (basically quantized as q8_0) on the llama.cpp side which could be effecting perplexity scores as measured??

Curious to seee what folks discover with some more time!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment