Benchmarks (Q8_0-Q4_0, RTX 5090, EPYC)

by sousekd - opened 15 days ago

Discussion

sousekd

15 days ago

•

edited 15 days ago

Here are benchmarks of Q8_0-Q4_0 running a single RTX 5090 and EPYC 9355:

128K context @ f16

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 \
    -amb 512 -b 8192 -ub 8192 \
    -ctk f16 -ctv f16 -c 131072 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 256

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	256	0	18.455	443.90	13.719	18.66
8192	256	8192	20.105	407.47	14.089	18.17
8192	256	16384	21.961	373.03	14.435	17.73
8192	256	24576	23.750	344.92	15.003	17.06
...	...	...	...	...	...	...
8192	256	98304	42.663	192.01	16.907	15.14
8192	256	106496	44.635	183.53	17.095	14.97
8192	256	114688	46.665	175.55	17.559	14.58
8192	256	122880	48.700	168.21	17.604	14.54

192K context @ q8_0

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -mla 3 \
    -amb 512 -b 4096 -ub 4096 \
    -ctk q8_0 -ctv q8_0 -c 196608 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch \
    -n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	15.120	270.90	6.918	18.50
4096	128	4096	15.601	262.54	7.105	18.02
4096	128	8192	16.057	255.09	7.150	17.90
4096	128	12288	16.468	248.72	7.353	17.41
...	...	...	...	...	...	...

Thank you for the quants @ubergarm .

After doing more perplexity measurements, I'm not sure q4_0 is the best choice despite fairly closely matching the original QAT target format... Needs more research...

Any findings? 😀

ubergarm

Owner 12 days ago

Any findings? 😀

@sousekd

Thanks for the benchmarks! I'm back at my desk today and gonna do some more perplexity measures, but you can see in the updated graph that the full q8_0 everything is scoring "better" than the q8_0-q4_0 which I wouldn't expect if the QAT was completely effective maybe?

Some folks on ai beavers discord want to measure the original with VLLM and try to get apples-apples comparison with a GGUF but that is tricky to do especially with these large sizes.

The main thing I've seen is to run this with an updated chat template (given the original was patched a couple times since release) as well as adding --special to get it to output <think> tags but then you have to deal with the stop token in the client side..

sousekd

12 days ago

I'm back at my desk today and gonna do some more perplexity measures, but you can see in the updated graph that the full q8_0 everything is scoring "better" than the q8_0-q4_0 which I wouldn't expect if the QAT was completely effective maybe?

Interesting. I know nothing about the quantization process, but I can imagine that converting any floating point number can lead to some data loss, so direct conversion would be preferred...

The main thing I've seen is to run this with an updated chat template [...]

...and the tool calling does not work yet - even on mainline llama.cpp.

ubergarm

Owner 12 days ago

...and the tool calling does not work yet - even on mainline llama.cpp.

Ohh really?? It seems like tool calling and MCP stuff can be so sensitive to exact implementation details, chat templates, and why Kimi released an entire tool K2-Vendor-Verifier github project to score a setup on how well it seems to actually be working...

sousekd

12 days ago

•

edited 12 days ago

Ohh really??

Yes: https://github.com/ggml-org/llama.cpp/issues/17155

I wanted to post a feature request on ik_llama, but it probably makes sense to wait for the llama.cpp implementation first.
There are more models with broken tool calling on ik_llama still. I hope to find a bit of time soon to test them all vs mainline and report back...

It is interesting how tool calling does not seem to be that important for many.
Even vLLM still has serious issues wuth gpt-oss tool calling - and it is not a new model by any means.
I guess most people use local models just for RP 😀. And coding, where the clients have their own implementation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment