Benchmarks (Q8_0-Q4_0, RTX 5090, EPYC)

#5
by sousekd - opened

Here are benchmarks of Q8_0-Q4_0 running a single RTX 5090 and EPYC 9355:

128K context @ f16

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 \
    -amb 512 -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 131072 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 256
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 256 0 18.455 443.90 13.719 18.66
8192 256 8192 20.105 407.47 14.089 18.17
8192 256 16384 21.961 373.03 14.435 17.73
8192 256 24576 23.750 344.92 15.003 17.06
... ... ... ... ... ... ...
8192 256 98304 42.663 192.01 16.907 15.14
8192 256 106496 44.635 183.53 17.095 14.97
8192 256 114688 46.665 175.55 17.559 14.58
8192 256 122880 48.700 168.21 17.604 14.54

192K context @ q8_0

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 \
    -amb 512 -b 4096 -ub 4096 \
    -ctk q8_0 -ctv q8_0 -c 196608 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 128
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 15.120 270.90 6.918 18.50
4096 128 4096 15.601 262.54 7.105 18.02
4096 128 8192 16.057 255.09 7.150 17.90
4096 128 12288 16.468 248.72 7.353 17.41
... ... ... ... ... ... ...

Thank you for the quants @ubergarm .

After doing more perplexity measurements, I'm not sure q4_0 is the best choice despite fairly closely matching the original QAT target format... Needs more research...

Any findings? 😀

Any findings? 😀

@sousekd

Thanks for the benchmarks! I'm back at my desk today and gonna do some more perplexity measures, but you can see in the updated graph that the full q8_0 everything is scoring "better" than the q8_0-q4_0 which I wouldn't expect if the QAT was completely effective maybe?

Some folks on ai beavers discord want to measure the original with VLLM and try to get apples-apples comparison with a GGUF but that is tricky to do especially with these large sizes.

The main thing I've seen is to run this with an updated chat template (given the original was patched a couple times since release) as well as adding --special to get it to output <think> tags but then you have to deal with the stop token in the client side..

I'm back at my desk today and gonna do some more perplexity measures, but you can see in the updated graph that the full q8_0 everything is scoring "better" than the q8_0-q4_0 which I wouldn't expect if the QAT was completely effective maybe?

Interesting. I know nothing about the quantization process, but I can imagine that converting any floating point number can lead to some data loss, so direct conversion would be preferred...

The main thing I've seen is to run this with an updated chat template [...]

...and the tool calling does not work yet - even on mainline llama.cpp.

...and the tool calling does not work yet - even on mainline llama.cpp.

Ohh really?? It seems like tool calling and MCP stuff can be so sensitive to exact implementation details, chat templates, and why Kimi released an entire tool K2-Vendor-Verifier github project to score a setup on how well it seems to actually be working...

Ohh really??

Yes: https://github.com/ggml-org/llama.cpp/issues/17155

I wanted to post a feature request on ik_llama, but it probably makes sense to wait for the llama.cpp implementation first.
There are more models with broken tool calling on ik_llama still. I hope to find a bit of time soon to test them all vs mainline and report back...

It is interesting how tool calling does not seem to be that important for many.
Even vLLM still has serious issues wuth gpt-oss tool calling - and it is not a new model by any means.
I guess most people use local models just for RP 😀. And coding, where the clients have their own implementation.

Sign up or log in to comment