IQ2_KS
Thanks for these quants!
Are you doing an IQ2_KS for this one (equivalent to https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/tree/main/IQ2_KS ?
Thanks! It fits (smaller than the IQ2_KS).
print_timings] prompt eval time = 35271.83 ms / 5656 tokens ( 6.24 ms per token, 160.35 tokens per second)
print_timings] generation eval time = 191089.56 ms / 2945 runs ( 64.89 ms per token, 15.41 tokens per second)
Did the -ooae flag get removed from ik_llama recently?
This is really good. The logic doesn't break down even at > 12k context and it seems to remember details from earlier in the chat.
It also doesn't just shift over to "You're absolutely right, I'm sorry" when I push back during problem solving.
Why is this one so good? Is the quant a lot better this time around (vs the IQ2_KS for the other Kimi models), or is the model just a lot smarter?
Did the -ooae flag get removed from ik_llama recently?
I can't find the exact PR, but ooae is now the default and can be disabled with --no-ooae i believe.
Why is this one so good? Is the quant a lot better this time around (vs the IQ2_KS for the other Kimi models), or is the model just a lot smarter?
why not both? haha... honestly I'm not sure, i'm still exploring how well the QAT actually translated over to various GGUF quantization types in terms of relative perplexity.
thanks for some of your comments in other discussions about --special and needing to fixup the <|im_end|> stop token!
thanks for some of your comments in other discussions about --special and needing to fixup the <|im_end|> stop token!
No problem, that one burned me in February when I distilled R1 -> Mistral-Large and it wouldn't print the special tokens in llama.cpp
now the default
Thanks, yeah I noticed a lot of things are default now, my scripts are a lot smaller lol