Q4_0=INT4?
Is INT4->BF16->Q4_0 really a lossless conversion?
Probably depends on the exact values, implementation details, and how you measure "lossless".
I know folks are using a few different ways to create these new Kimi-K2-Thinking-GGUFs. Then on top of that I've seen some recipes using q4_K which might map a bit differently than q4_0 for example.
Given the original model was QAT'd on a quantization type that is 32 4bit blocks per bfloat16 scale and q4_0 is 32 4bit blocks per float16 scale this is gonna be an interesting experiment. I'm choosing q4_0 for the routed experts on my better recipes with the idea if it is not lossless, it is probably about the closest available to the original QAT target quantization.
I did print out and spot check a handful of the original bfloat16 scale values which all seemed to very small numbers under 1.0 so hopefully there isn't any overflow/clipping assuming the implementations use larger dtypes for the calculations it should be pretty close.
Also I have some perplexity numbers coming in slowly and hope to compare a few other quantizers recipes so we can have some kind of data to look at for what that is worth.
Would Q4_K be better or just skewed in different way?
Would Q4_K be better or just skewed in different way?
So assuming the original bfloat16 scales fit well enough into the q4_0 float16 scales, I imagine q4_K would be slightly worse given the values will be "skewed in a different way" than the original QAT training. So the assumption here is q4_0 "fits" the original values better.
Here is some reading for you on ik's mainline llama.cpp iq4_K implementation for more details:
GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
https://github.com/ggml-org/llama.cpp/pull/1684
But honestly it could be "close enough" it won't be easily measurable in actual use. Perplexity and KLD could probably give us some rough idea at least.
Some more in-depth technical discussion in this thread over here: https://github.com/ggml-org/llama.cpp/pull/17064#issuecomment-3503272247
I thought there was a way to do a direct conversion with llama.cpp without going to BF16, or is it limited to FP8?
@ubergarm Re Q4_0 - maybe Q4_1 might be slightly better to adjust for BF16 vs F16? Ie an extra scaling factor
There is a direct fp8 safetensors to BF16 GGUF method yes. But there is a recent increase in huggingface safetensors uploaded using the compressed-tensors vllm related project, and that allows for arbitrary quantization configurations which may or may not be exactly supported on llama.cpp. So this will be an ongoing issue into the future as original saftensors are released with various pre-quantized mixes.
@danielhanchen Right Q4_1 is 5bpw given it adds another fp16 offset in addition to the existing fp16 scale per block. I don't think that would help in this case as the potential issue is the original bf16 scale must be "downcast" into f16 with potential clipping to fit into exactly q4_0 which would be true even with the extra offset.
Now that I have some perplexity data coming in, I'm kinda surprised how much better the pure Q8_0 scored than the supposed "QAT correct" mix of Q8_0-Q4_0. I also measured a couple unsloth quants and
@danielhanchen
's UD-Q4_K_XL scores slightly better perplexity than the Q8_0-Q4_0 mix at about the same size (just over 4.5bpw) despite using the "wrong" quantization types for the QAT haha...
This could be related to how they created their bf16, and I haven't compared a Q8_0 from their BF16 to my Q8_0 from my BF16 as that would take a while.
Though as shown by the worse unsloth Q2_K performance it seems like ik's quants are doing better at the lower BPW regardless of any QAT business ...
I have a hunch that round-tripping to bf16 and then back to the slightly different q4_0 format loses the benefits of the QAT (and perhaps damages the model more by quantizing it a second time), but this is based on intuition alone and not any kind of evidence.
Now that I have some perplexity data coming in, I'm kinda surprised how much better the pure Q8_0 scored than the supposed "QAT correct" mix of
Q8_0-Q4_0. I also measured a couple unsloth quants and @danielhanchen 'sUD-Q4_K_XLscores slightly better perplexity than theQ8_0-Q4_0mix at about the same size (just over 4.5bpw) despite using the "wrong" quantization types for the QAT haha...
Would then Q5_K be better until llama.cpp makes a proper conversion script?