Feedback on smol-IQ1_KT
Hi again,
I'm sharing performances on the 1 bit quant. Specs are following, I leave them for others considering a similar setup:
Ryzen 9900x
Asus X870E creator
192GB@6000MT (~70GB/S of bandwidth)
5090+2x3090+4070TI super
Here are the numbers:
./ik_llama.cpp/build/bin/llama-sweep-bench \
--model /home/llm_models/ling-1T/smol-IQ1_KT/Ling-1T-smol-IQ1_KT-00001-of-00005.gguf\
--ctx-size 32768 \
-fa -fmoe -ger \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
-ngl 99 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10)\.ffn_.*=CUDA0" \
-ot "blk\.(11|12|13)\.ffn_.*=CUDA1" \
-ot "blk\.(14|15|16|17|18|19)\.ffn_.*=CUDA2" \
-ot "blk\.(20|21|22|23|24)\.ffn_.*=CUDA3" \
-ot exps=CPU \
--no-mmap \
--threads 11 \
--parallel 1
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 11, n_threads_batch = 11
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 17.148 | 238.86 | 210.602 | 4.86 |
| 4096 | 1024 | 4096 | 16.660 | 245.86 | 215.155 | 4.76 |
| 4096 | 1024 | 8192 | 17.413 | 235.22 | 218.407 | 4.69 |
| 4096 | 1024 | 12288 | 16.972 | 241.34 | 221.125 | 4.63 |
| 4096 | 1024 | 16384 | 17.413 | 235.23 | 224.040 | 4.57 |
| 4096 | 1024 | 20480 | 17.712 | 231.25 | 228.519 | 4.48 |
| 4096 | 1024 | 24576 | 17.403 | 235.36 | 231.892 | 4.42 |
| 4096 | 1024 | 28672 | 17.337 | 236.25 | 235.761 | 4.34 |
Prompt processing is fast, text generation so so. Kimi K2 is faster overall, both in PP and especially TG, even if it's the same size.
As for the quants and the model, good news! The 1 bit quant is definitely usable for conversation and roleplay, it's able to explain its reasoning and motivations, which is quite a feat giving it's a 1 bit quant. The model however is very confrontational. It has a preference for purple prose and melodrama, and likes to make every character edgy. Feels like DeepSeek V3 0324 but more untamed. I don't know if it's a consequence of the low quants, managed to load the smallest Q2 quant but it swapped on disk, and was very slow. Had a couple of exchanges with it, and it felt definitely better than the 1 quant version.
So overall I prefer Kimi K2 or Deepseek style wise, but this one is a very good model anyway, just not my cup of tea.
Thanks again for the quants @ubergarm !