Q4_X perplexity?

#13
by Fernanda24 - opened

Hi I can help running perplexity scores as well. Are there instructions anywhere for to run them? Happy to share results :)

Here is a summary of my most recent workflow for calculating perplexity for reference as well as taken from this recent post: https://github.com/ikawrakow/ik_llama.cpp/issues/942#issuecomment-3536933398


Can you please share your command to measure perplexity?

Sure here it is again. Right I always use default 512 context and unquantized f16 kv-cache for my published numbers in the charts. And yes the usual wiki.test.raw file.

$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ ls -lah wiki.test.raw
-rw-rw-r-- 1 w w 1.3M Mar  5  2025 wiki.test.raw
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea  wiki.test.raw

$ numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    -mla 3 \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa numactl \
    --threads 96 \
    --threads-batch 128 \
    --no-mmap

The seed does nothing here, it is just for fun. I don't think you need -mla 3 anymore as that is default now. I specify context just to be explicit, but 512 is the default value. You can adjust batch size as needed for your rig (generally i avoid going over 4096) and it doesn't effect results. Of course adjust threads, offload, and others as desired.

@ubergarm wondering, what is the effect of increasing batch size without increasing ubatch size? Would that increase perf and/or VRAM usage?

@Panchovix

I honestly don't know for all situations, you can try it and see what the debug logs print out for all the various buffers. The defaults are -ub 512 -b 2048 fwiw. I generally am only considered with -ub and if I ever set ub above 2048 I just set batch size -b to the same value so it is greater than or equal to ub. I am not aware of any benefit of further increasing batch size without increasing ubatch size.

This might be useful reading specific to MLA models discussing the effects of changing micro batch (ubatch) sizes on compute buffers for MLA models before ik added -amb (which is another good closed PR to read): https://github.com/ikawrakow/ik_llama.cpp/pull/235#issuecomment-2688652721

i usually do -b 2048 and -ub 4096. it seems to max out my system and with lower footprint than -b 4096 (they give me same pp and tg) i dont have scientific results but after lots of experimentation thats what seems to work best on my system. ok so unsloth iq3xxs got PPL = 2.3270 +/- 0.01046

ps file size from running du -h on unsloth iq3xxs its 393G

unsloth ud-q3_k_xl at size 424G (from "du -h") gets PPL = 2.2257 +/- 0.00993

unsloth q4_k_m @ 579G PPL = 2.1196 +/- 0.00925

ok Q4_X coming up next

ubergarm Q4_X @544G: PPL = 2.0818 +/- 0.00903

stoked!!!!

For reporting size I look at the "model size" printed out at the beginning of llama-server or llama-perplexity e.g. for that UD-Q3_K_XL you mention:

llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q3_K - Medium
llm_load_print_meta: model params     = 1.026 T
llm_load_print_meta: model size       = 423.853 GiB (3.547 BPW)  <--- this is what I use for my graphs
llm_load_print_meta: repeating layers = 422.340 GiB (3.543 BPW, 1024.059 B parameters)
llm_load_print_meta: general.name     = Kimi-K2-Thinking

Keep in mind GiB is not the same as GB and I try to report them correctly in the graphs which confuses a lot of people.

I believe du -h reports in GiB but might be "logical size" depending on your disks formatting but probably pretty close for a few large files.

Here are my results as shown in other discussion. Glad your measurement matches mine for the Q4_X which is the "full size" model adjusted into Q4_0 as well as possible going from original comprssed-tensors into llama.cpp GGUF.

ppl-Kimi-K2-Thinking

Note Unsloths Q2_K does not use an imatrix so it is higher up. I've changed a couple names a bit for the graph on the modelcard.

Smol IQ3_KS is not released right? It seems it has a good PPL for it's size.

@Panchovix @Fernanda24

wait a sec, i thought i had uploaded the smol-IQ3_KS already... Fernanda24 was asking me about it too over here: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/14

I'll look at it again more closely and likely upload it or another slight variation tonight. thanks for helping me keep my head on straight, so many quants and numbers haha...

Okay, I slightly modified the smol-IQ3_KS recipe given it had been using a few layers of the un-patched q4_0. I tried it with q4_x but was using imatrix so no difference. I think the very very slight increase in perplexity isn't bad for it being quite a bit smaller:

ppl-Kimi-K2-Thinking

Gonna split it and upload it and update graphs so the v3-smol-IQ3_KS is the final version. the v1 was never uploaded πŸ˜…

Those look pretty good! If iklcpp has RPC I could make it work. But if not, maybe with 16K ctx and Q8 cache using partial layers haha.

@Panchovix

you can always go q6_0 kv-cache haha... πŸ˜…

i thought ik had basic RPC backend, but not sure it has been updated with latest changes, that is a good question.

I get pretty good results on RPC, at least on GLM 4.6 i.e. https://github.com/ggml-org/llama.cpp/discussions/16625#discussioncomment-14711806

If it does get the latest changes it would be great!

@Panchovix

Interesting RPC is good enough to use in some cases it sounds like. I haven't looked into how difficult it would be to port the updated RPC features from mainline to ik. I think there are 3 or 4 main RPC PRs closed on mainline beginning in early October that improved performance beginning with this one maybe: https://github.com/ggml-org/llama.cpp/pull/16276

Feel free to open an issue on IKs fork and mention it is good enough that you would use it if it were available on ik_llama.cpp

@Panchovix @ubergarm i tried Q4_X with RPC. you seem to have to do one RPC instance (i used screen for this) for each device. anyways I did offload about 130GiB via RPC and the bandwith for token gen seems to be around 35% 10Gbe ethernet saturation. So 1Gbe isn't enough for RPC with these big models. I got massive improvements from connecting an Ethernet cable directly between the two rather then hopping over switches. Latency improved 10x so that is probably most of it. Got about 62tps for pp (3733 token prompt) and 8tps for tg (generating 740ts). Using a a mix of mostly P40s and a couple of 3090s on the RPC server. p40s still faster then ddr4 but not sure for ddr5 which i dont't have. they have about 336gb/sec bandwidth i believe. Oh this was with llama.cpp not ik_llama. Still have to try ik_llama and experiment more. Edit: running q4_x just on the main server on ik_llama with mixed cpu/gpu gets similar perf as llama.ppp with rpc to more gpus. So ik_llama + rpc upgrade might be even faster

Sign up or log in to comment