Feedback on smol-IQ1_KT

by wunderschnitzel - opened 26 days ago

26 days ago

Hi again,

I'm sharing performances on the 1 bit quant. Specs are following, I leave them for others considering a similar setup:
Ryzen 9900x
Asus X870E creator
192GB@6000MT (~70GB/S of bandwidth)
5090+2x3090+4070TI super

Here are the numbers:

./ik_llama.cpp/build/bin/llama-sweep-bench \
    --model /home/llm_models/ling-1T/smol-IQ1_KT/Ling-1T-smol-IQ1_KT-00001-of-00005.gguf\
    --ctx-size 32768 \
    -fa -fmoe -ger \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    -ngl 99 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10)\.ffn_.*=CUDA0" \
    -ot "blk\.(11|12|13)\.ffn_.*=CUDA1" \
    -ot "blk\.(14|15|16|17|18|19)\.ffn_.*=CUDA2" \
    -ot "blk\.(20|21|22|23|24)\.ffn_.*=CUDA3" \
    -ot exps=CPU \
    --no-mmap \
    --threads 11 \
    --parallel 1
    
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 11, n_threads_batch = 11

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   17.148 |   238.86 |  210.602 |     4.86 |
|  4096 |   1024 |   4096 |   16.660 |   245.86 |  215.155 |     4.76 |
|  4096 |   1024 |   8192 |   17.413 |   235.22 |  218.407 |     4.69 |
|  4096 |   1024 |  12288 |   16.972 |   241.34 |  221.125 |     4.63 |
|  4096 |   1024 |  16384 |   17.413 |   235.23 |  224.040 |     4.57 |
|  4096 |   1024 |  20480 |   17.712 |   231.25 |  228.519 |     4.48 |
|  4096 |   1024 |  24576 |   17.403 |   235.36 |  231.892 |     4.42 |
|  4096 |   1024 |  28672 |   17.337 |   236.25 |  235.761 |     4.34 |

Prompt processing is fast, text generation so so. Kimi K2 is faster overall, both in PP and especially TG, even if it's the same size.

As for the quants and the model, good news! The 1 bit quant is definitely usable for conversation and roleplay, it's able to explain its reasoning and motivations, which is quite a feat giving it's a 1 bit quant. The model however is very confrontational. It has a preference for purple prose and melodrama, and likes to make every character edgy. Feels like DeepSeek V3 0324 but more untamed. I don't know if it's a consequence of the low quants, managed to load the smallest Q2 quant but it swapped on disk, and was very slow. Had a couple of exchanges with it, and it felt definitely better than the 1 quant version.
So overall I prefer Kimi K2 or Deepseek style wise, but this one is a very good model anyway, just not my cup of tea.

Thanks again for the quants @ubergarm !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment