ubergarm/GLM-4.6-GGUF · IQ5_K benchmarks on AMD EPYC with a single GPU (RTX 5090/6000)

about 1 month ago

•

Here are some performance tests of the IQ5_K version running on EPYC 9355, RTX 5090, and RTX Pro 6000.
Sharing these benchmarks for anyone thinking about this kind of setup.

CPU power capped at 280 W, GPUs at 450 W
Running inside a Proxmox/QEMU VM, with only 32 real cores assigned (SMT cores reserved for host)
Testing various combinations of context size and batch size fitting the VRAM

./llama-sweep-bench \
    -fa -fmoe \
    -ngl 999 \
    -ot exps=CPU \
    --no-mmap \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch

RTX 5090 - 32k @ f16 (`-ub 2048`)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	7.465	274.33	29.334	17.45
2048	512	2048	7.546	271.39	29.636	17.28
2048	512	4096	7.629	268.45	29.921	17.11
2048	512	6144	7.715	265.45	30.586	16.74
2048	512	8192	7.808	262.29	30.479	16.80
2048	512	10240	7.930	258.27	30.871	16.58
2048	512	12288	7.981	256.62	31.220	16.40
2048	512	14336	8.061	254.07	31.598	16.20
2048	512	16384	8.172	250.61	31.927	16.04
2048	512	18432	8.266	247.78	32.196	15.90
2048	512	20480	8.101	252.82	32.558	15.73
2048	512	22528	8.251	248.23	32.963	15.53
2048	512	24576	8.305	246.59	34.211	14.97
2048	512	26624	8.456	242.20	34.201	14.97
2048	512	28672	8.563	239.18	34.049	15.04
2048	512	30720	8.641	237.00	34.399	14.88

RTX 5090 - 32k @ q8_0 (`-ub 8192`)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	12.104	676.80	123.036	16.65
8192	2048	8192	13.802	593.52	131.376	15.59
8192	2048	16384	15.529	527.52	138.550	14.78
8192	2048	24576	17.297	473.62	150.579	13.60

RTX 6000 - 32k @ q8_0 (`-ub 16384`, `-ot "\.(2[6-9]|[3-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"`)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
16384	4096	0	16.896	969.68	219.313	18.68
16384	4096	16384	23.399	700.21	247.435	16.55

RTX 6000 - 192k @ f16 (`-ub 8192`)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	11.294	725.36	121.533	16.85
8192	2048	8192	12.679	646.12	125.605	16.31
8192	2048	16384	13.913	588.79	132.592	15.45
8192	2048	24576	15.433	530.81	138.509	14.79
8192	2048	32768	17.222	475.67	144.749	14.15
8192	2048	40960	19.431	421.58	151.219	13.54
8192	2048	49152	21.785	376.03	157.030	13.04
8192	2048	57344	24.127	339.54	163.342	12.54
8192	2048	65536	26.443	309.80	171.141	11.97
8192	2048	73728	28.571	286.73	186.935	10.96
8192	2048	81920	30.809	265.90	218.675	9.37
8192	2048	90112	33.121	247.33	245.795	8.33
8192	2048	98304	35.498	230.78	260.610	7.86
8192	2048	106496	38.019	215.47	273.867	7.48
8192	2048	114688	40.699	201.28	285.800	7.17
8192	2048	122880	43.860	186.78	297.138	6.89
8192	2048	131072	46.158	177.48	308.663	6.64
8192	2048	139264	47.971	170.77	319.498	6.41
8192	2048	147456	50.163	163.31	331.344	6.18
8192	2048	155648	52.666	155.55	343.088	5.97
8192	2048	163840	55.144	148.56	354.767	5.77
8192	2048	172032	57.282	143.01	365.980	5.60
8192	2048	180224	59.690	137.24	380.191	5.39
8192	2048	188416	62.231	131.64	389.775	5.25

I could squeeze out slightly better results by using more threads or lifting the power limits, but it's not really worth the extra heat and noise.
Any feedback, configuration advice, or questions are very welcome!

ubergarm

Owner about 1 month ago

Thanks for sharing your results! The command looks reasonable and cool you can hit 8192 batch sizes (some folks have trouble with higher sizes on their rigs).

CPU power capped at 280 W, GPUs at 450 W

The only tip I can think of off-hand is to consider not power capping your GPUs with nvidia-smi -pl 450 or whatever and instead use the LACT "indirect undervolt" method assuming Linux or something like MSI Afterburner on Windoze.

On Beaver AI Discord a guy has an RTX 6000 PRO and said by locking the boost clock and setting 1000MHz offset (which seems scaled in Blackwell vs older arch GPUs) that it runs at same performance and never goes much over 300W anyway. I have a post on Reddit, l1t, and there is a good github issue thread describing it on the LACT github repo. It works on headless systems as well and can be persisted across boots. @Panchovix helped spread this method and explain how the Blackwell GPUs use like 10x the offset as earlier models for some reason.

https://github.com/ilya-zlobintsev/LACT/issues/486#issuecomment-3303917115

sousekd

30 days ago

•

edited 30 days ago

Thank you @ubergarm for the tip about LACT, I'll sure check it out!

I started power-limiting when some people stressed me out here about my 2200W PSU, claiming it is not strong enough for 2x CPU, 2x GPU, 4x HDD and 3x SSD 😀. Although I disagree, I decided that putting some power limits in place might not be a bad idea...

Yeah, higher -ub values work well on my machine. Here I tested -ub 16384 to get over 1000 t/s PP.
What I find quite surprising is that offloading some/more layers has very little effect... but I guess it is just math - I'd need to offload much more to see a noticeable improvement.

sousekd

30 days ago

This comment has been hidden (marked as Off-Topic)

ubergarm

Owner 30 days ago

Yeah as @anikifoss points out on the linked thread

GLM-4.6 has 92 layers

So each additional routed expert layer offload to VRAM isn't a large overall percentage especially given only 8 or however many are active per token - its very sparse.

You can still add a max power cap in addition to LACT if you're really worried about tripping the power supply, but yeah play around and get your GPUs tuned up one at a time then save the config into /etc/lact/config.yaml and you'll be gucci!

No idea what this was about, it said this already by the time I read this:

This comment has been hidden (marked as Off-Topic)

sousekd

30 days ago

No idea what this was about (...)

HF is acting strange lately... but this was my mistake ;)

3abkari

1 day ago

Here are some performance tests of the IQ5_K version running on EPYC 9355, RTX 5090, and RTX Pro 6000.
Sharing these benchmarks for anyone thinking about this kind of setup.

CPU power capped at 280 W, GPUs at 450 W

Running inside a Proxmox/QEMU VM, with only 32 real cores assigned (SMT cores reserved for host)

Testing various combinations of context size and batch size fitting the VRAM

This is really neat! How much RAM do you have? I assume not everything is fitting on the 96GB of VRAM.

IQ5_K benchmarks on AMD EPYC with a single GPU (RTX 5090/6000)

RTX 5090 - 32k @ f16 (-ub 2048)

RTX 5090 - 32k @ q8_0 (-ub 8192)

RTX 6000 - 32k @ q8_0 (-ub 16384, -ot "\.(2[6-9]|[3-9][0-9])\.ffn_(gate|up|down)_exps.=CPU")

RTX 6000 - 192k @ f16 (-ub 8192)

RTX 5090 - 32k @ f16 (`-ub 2048`)

RTX 5090 - 32k @ q8_0 (`-ub 8192`)

RTX 6000 - 32k @ q8_0 (`-ub 16384`, `-ot "\.(2[6-9]|[3-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"`)

RTX 6000 - 192k @ f16 (`-ub 8192`)